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Abstrti'.t 

Genetic  algorithms  (GAs)  are  biologically  motivated  adaptive  systems  which  have  been  used, 
with  varying  degrees  of  success,  for  function  optimization.  In  this  study,  an  abstraction  of  the 
basic  genetic  algorithm,  the  Equilibrium  Genetic  Algorithm  (EGA),  and  the  GA  in  turn,  are 
reconsidered  within  the  hamework  of  competitive  learning.  This  new  persp>ective  reveals  a 
number  of  different  possibilities  for  performance  improvements.  This  paper  explores  popula¬ 
tion-based  incremented  learning  (PBIL),  a  method  of  combining  the  mechanisms  of  a  genera¬ 
tional  genetic  algorithm  with  simple  competitive  learning.  The  combination  of  these  two 
methods  reveals  a  tool  which  is  far  simpler  than  a  GA,  and  which  out-performs  a  GA  on  large 
set  of  optimization  problems  in  terms  of  both  speed  and  accuracy.  This  paper  presents  an 
empiric^  aiudysis  of  where  the  proposed  technique  will  outperform  genetic  algorittuns,  and 
describes  a  cl^  of  problems  in  which  a  genetic  algorithm  may  be  able  to  perform  better 
Extensions  to  this  algorithm  are  discussed  and  analyzed.  PBIL  and  extensions  are  compared 
with  a  standard  GA  on  twelve  problems,  including  standard  numerical  optimization  func¬ 
tions,  traditional  GA  test  suite  pr^lems,  and  NF-Complete  problems. 
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Population-Based  Incremental  Learning: 

A  Method  for  Integrating  Genetic  Search  Based  Function 
Optimization  and  Competitive  Learning 


Abstract: 

Genetic  algorithms  (GAs)  are  biologically  motivated  adaptive  systems  which  have  been  used, 
with  varying  degrees  of  success,  for  function  optimization.  In  this  study,  an  abstraction  of  the 
basic  genetic  algorithm,  the  Equilibrium  Genetic  Algorithm  (EGA),  and  the  GA  in  turn,  are 
recoiuidered  within  the  framework  of  competitive  learning.  This  new  perspective  reveals  a 
number  of  different  possibilities  for  performance  improvements.  This  paper  explores  popula¬ 
tion-based  incremental  learning  (PBIL),  a  method  of  combining  the  mechanisms  of  a  genera¬ 
tional  genetic  algorithm  with  simple  competitive  learning.  The  combination  of  these  two 
methods  reveals  a  tool  which  is  far  simpler  than  a  GA,  and  which  out-performs  a  GA  on  large 
set  of  optimization  problems  in  terms  of  both  speed  and  accuracy.  This  paper  presents  an 
empirical  analysis  of  where  the  proposed  technique  will  outperform  genetic  algorithms,  and 
describes  a  cl^  of  problems  in  which  a  genetic  algorithm  may  be  able  to  perform  better. 
Extensions  to  this  algorithm  are  discussed  and  analyzed.  PBIL  and  extensions  are  compared 
with  a  standard  GA  on  twelve  problems,  including  standard  numerical  optimization  func¬ 
tions,  traditional  GA  test  suite  problems,  and  NP-Complete  problems. 


1.  Introduction 

Genetic  algorithms  and  artificial  neural  networks  are  two  recently  developed  techniques  that  have 
received  a  large  amovmt  of  attention  in  the  artificicil  intelligence  research  commuiuties.  Many  of  the  current 
efforts  in  genetic  algorithm  research  have  stressed  their  role  as  general  purpose  function  optimizers,  which 
have  been  applied  to  functions  in  the  fields  of  biological  modelling,  NP-complete  problem  approximation 
heuristics  and  standcurd  numerical  optimization.  A  large  part  of  the  research  effort  in  the  field  of  neural 
networks  and  competitive  learning  has  been  devoted  to  clustering  and  classification  tasks,  with  a  wide 
variety  of  applications,  such  as  feature  mapping,  speech  recognition  [Waibel,  1989],  scene  analysis  and 
robot  control  [Pomerleau,  1989]  rmd  data  compression  [Hertz,  Krogh,  Palmer,  1993].  There  are  alk)  many 
continuing  efforts  to  perform  clrissification  and  concept  leruning  tasks  using  genetic  algorithms  [De  Jong  et 
al.,  1993][Janikow,  lW3][Booker  et  al.,  1990][Goldberg,1989].  Similarly,  optimization  using  artificial  neural 
networks  is  a  widely  researched  area.  Perhaps  the  most  widely  known  attempt  is  that  of  the  Hopfield  and 
Tank  Traveling  Salesman  model  [Hopfield  &  Tank,  1985][Hertz,Krogh,  Palmer,  1993]. 
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An  abstraction  of  the  basic  GA,  termed  the  "Equilibrium  Genetic  Algorithm"  (EGA),  has  been  developed 
in  conjvmction  widt  Quels,  1993,1994].  The  EGA  attempts  to  describe  the  limit  popvilation  of  a  genetic  algo- 
ridun  by  an  equilibrium  point,  under  the  assumptiim  that  d\e  population  is  adways  "mated  to  conver¬ 
gence."  This  process  can  be  viewed  as  a  form  of  eliminatmg  die  explicit  crossover  st^  in  standard  genetic 
algoridun  search.  In  the  work  with  Juels,  die  representation  of  the  population  by  using  an  equilibriiun 
point  was  explored.  The  work  presented  in  this  paper  ccmcentrates  on  the  operators  needed  to  make  diis 
new  representaticHi  effective  for  search;  the  operators  are  presented  in  the  context  of  supervised  competi¬ 
tive  learning.  Issues  regarding  the  use  of  the  resvilting  algoridun,  such  as  the  settings  of  the  parameters,  the 
contribution  of  a  mutation  operator  as  well  as  the  algorithm's  relation  to  more  traditional  genetic  search, 
are  examined.  It  is  determine  that  the  framework  of  competitive  learning  provides  a  b«isis  from  which  to 
explain  the  performance  of  the  EGA.  The  new  framework  also  provides  the  mechanisms  for  many  exten¬ 
sions  to  the  basic  algorithm  which  improve  the  performance  of  the  EGA.  ^  Several  of  die  extensions  are 
explored  in  this  paper,  and  many  more  are  presented  as  directions  for  future  research. 

This  paper  presents  population-based  incremental  Ie<iming  (PBIL),  a  method  of  combining  genetic  algo- 
ridims  and  competitive  learning  for  function  optimization.  PBIL  is  an  extension  to  the  EGA  algorithm 
achieved  through  the  re-examination  of  the  performance  of  die  EGA  in  terms  of  competitive  learning.  This 
paper  also  presents  a  detailed  empirical  examination  of  die  EGA  and  PBIL  algorithm  on  a  suite  of  prob¬ 
lems  designed  to  test  their  abilities  in  a  wide  variety  of  function  optimization  domains.  Althou^  this 
work  has  ties  with  other  areas  of  research  involving  genetic  algorithms,  to  limit  the  scope  of  this  paper, 
PBIL  is  explored  exclusively  in  the  context  of  function  optimization.  Other  recent  work  in  combining  GAs 
with  techniques  related  to  artificial  neural  networks  Im  been  largely  directed  in  alternative  directions, 
such  as  using  GAs  for  optimizing  neural  network  structure  and  performance  [Whitiey  &  Schafier, 
1992][Gruau,  1993][Polani  &  Uthamaim,  1993][Whidey  &  Hanson,  1989].  One  of  the  dosest  related  works, 
which  also  employed  an  underlying  artificial  neural  network  structure  to  augment  genetic  search,  can  be 
found  in  Ackley's  A  Connectionist  Machine  for  Genetic  Hillclimbing,  [Ackley',  1987][Ackley,  1985]. 

The  remainder  of  this  section  describes  the  prindples  of  genetic  based  optimization  and  competitive  learn¬ 
ing.  Section  2  describes  the  combination  of  both  of  the  techniques  by  examining  the  role  of  the  population 
in  a  genetic  algorithm.  Section  3  examines  the  effects  of  changing  some  of  the  parameters  in  the  PBIL  algo¬ 
rithm.  Section  4  presents  extensions  of  the  algorithm,  based  on  Kohonen's  learning  vector  quantization 
algorithm  (Kohonen,  1988,1989].  In  section  5,  the  basic  PBIL  and  its  extensions  are  empirically  examined 
on  a  suite  of  twelve  test  problems.  Section  6  discusses  the  applicability  of  the  results  reported  in  section  5 
and  condudes  with  a  discussion  of  the  benefits  of  using  PBIL  in  comparison  to  standard  genetic  algo¬ 
rithms  for  function  optimization.  Finally,  section  7  closes  this  report  with  suggestions  for  future  research. 


1.1.  Introduction  to  Genetic  Algorithms 

Genetic  cdgorithms  (GAs)  ate  biologically  motivated  adaptive  systems  which  are  based  upon  the  prind¬ 
ples  of  natural  selection  and  genetic  recombination.  Although  GAs  have  often  been  used  for  function  opti¬ 
mization,  some  of  the  pioneering  work  [Holland,  1975]  suggests  an  alternative  view  of  GA  behavior.  De 
Jong  summarizes  this  view  in  the  following  paragraph: 

"Holland  [Holland,  1975]  has  suggested  that  a  better  way  to  view  GAs  is  in  terms  of  optimiz¬ 
ing  a  sequential  decision  process  involving  uncertainty  in  the  form  of  lack  of  a  priori  knowl¬ 
edge,  noisy  feedback  and  a  time-varying  payoff  function.  Mote  specifically,  given  a  limited 
number  of  trials,  how  should  one  allocate  them  so  as  to  maximize  the  cumulative  payoff  in  the 
face  of  all  this  imcertainty?"  [De  Jong,  92]. 

This  has  important  implications  to  the  use  of  genetic  algoritiims  as  function  optimizers.  Although  genetic 
algorithms  may  have  the  ability  to  quickly  find  regions  of  high  perfonrumce  in  the  face  of  noise  and  time- 


1.  To  avoid  confusion,  the  term  "EGA"  will  be  used  only  when  referencing  Quels,  1993]  explanation  of  the  algorithm. 
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varying  payoff  functions,  they  may  be  unable  to  find  the  absolute  optimal  soluticxi,  in  time-varying  or  sta- 
tioiury  environments. 

The  GAs  described  in  the  current  literature  are  highly  modified  to  recoiKile  the  differences  between  func¬ 
tion  optimizaticm,  in  which  the  effectiveness  is  measured  by  the  best  solution  foimd,  and  maximizing 
cumulative  returns,  in  which  finding  the  absolute  best  solution  may  not  be  the  orUy  critical  issue.  The  basic 
genetic  algorithm  and  some  of  die  common  modifications  which  have  been  nuide  for  effective  function 
optimization  are  described  below. 

A  GA  combines  the  principles  of  survival  of  the  fittest  with  a  randomized  information  exchange.  Although 
the  information  exchange  is  randomized,  the  GA  is  far  different  than  a  simple  random  walk.  A  GA  has  the 
ability  to  recognize  trends  toward  optimal  solutioits,  and  exploit  such  information  by  guiding  the  search 
towafo  diem. 

A  genetic  algorithm  maintains  a  population  of  potoitial  solutions  to  the  objective  function  being  opti¬ 
mized.  The  initial  group  of  potential  solutions  is  determined  randomly.  These  potential  solutions,  called 
"chromosomes,"  are  allowed  to  evolve  over  a  number  of  generations.  At  every  generation,  the  fitness  of 
each  chromosome  is  calculated.  The  fitness  is  a  measure  of  how  well  the  potent!^  solution  optimizes  the 
ol^ective  function.  The  subsequent  generation  is  created  through  a  process  of  selection  and  recombination. 
Tlie  chromosomes  are  probabilistically  chosen  for  recombination  based  upon  dieir  fitness;  this  is  a  measure 
of  how  well  the  chromosomes  achieve  the  desired  goal  (e.g.  find  the  minimum  in  a  specified  function,  etc.). 
The  recombination  operator  combines  the  information  contained  within  pairs  of  selected  "parents",  and 
places  a  mixture  of  the  information  from  both  parents  into  a  member  of  the  subsequent  generation.  Selec¬ 
tion  and  recombination  are  the  mechanisms  through  which  the  population  "evolves."  Although  the  chro¬ 
mosomes  with  high  fitness  values  will  have  a  higher  probability  of  being  selected  for  recombination  than 
those  which  do  not,  they  are  not  guaranteed  to  appear  in  the  next  generation.  The  "children"  chromo¬ 
somes  produced  by  the  genetic  recombination  are  not  necessarily  better  than  their  "parent"  chromosomes. 
Nevertheless,  because  of  the  selective  pressure  applied  through  a  number  of  generations,  the  overall  trend 
is  towards  better  chromosomes. 

In  order  to  perform  extensive  search,  genetic  diversity  must  be  maintained.  When  diversity  is  lost,  it  is 
possible  for  the  GA  to  settle  into  a  sub^ptimal  state.  There  are  two  fundamental  mechanisms  which  the 
basic  GA  uses  to  maintain  diversity.  The  first,  mentioned  above,  is  a  probabilistic  scheme  for  selecting 
which  chromosomes  to  recombine.  This  ensvues  that  information  other  than  that  represented  in  the  best 
chromosomes  appears  in  the  subsequent  generation.  Recombirting  only  good  chromosomes  will  very 
quickly  converge  the  population  without  extensive  exploration,  thereby  increasing  the  possibility  of  find¬ 
ing  only  a  local  optimum.  The  second  mechrmism  is  mutation;  mutations  are  used  to  help  preserve  diver¬ 
sity  and  to  escape  from  local  optima.  Mutations  introduce  random  changes  into  the  population. 

The  genetic  algorithm  is  typically  allowed  to  continue  for  a  fixed  number  of  generations.  At  the  conclusion 
of  the  specified  number  of  generations,  the  best  chromosome  in  the  final  population,  or  the  best  chromo¬ 
some  ever  fovmd,  is  returned.  Unlike  the  majority  of  other  search  hevuistics,  genetic  algorithms  do  not 
work  ffom  a  single  point  in  the  function  space.  GAs  continually  maintain  a  population  of  points  from 
which  the  function  space  is  explored. 

In  using  GAs  for  function  optimization,  many  issues,  such  as  proper  scaling  of  functions,  ensuring  that 
good  information  is  not  lost  due  to  random  chrmce,  and  efficient  problem  representation,  need  to  be 
resolved.  Although  GAs  can  often  find  regions  of  high  performance,  it  is  much  harder  for  the  GA  to  move 
to  the  optinud  solution.  One  potential  reason  for  this  inability  is  that  the  differential  between  good  and 
optimal  solutions  may  be  very  small  in  comparison  with  the  differenticd  between  good  and  bad  solutions. 
Iri  designing  an  effective  genetic  based  function  optintizer,  it  is  necessary  to  be  able  to  provide  a  large 
enough  "itKentive"  for  the  GA  to  make  progress  given  only  small  differentials.  One  method  of  maintain¬ 
ing  a  large  differential  between  potential  solutions  is  to  employ  a  method  of  dynamic  scaling,  so  that  the 
fitness  of  each  solution  is  measu^  relative  to  the  fitness  of  the  other  solutions  in  the  current  population. 
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As  a  genetic  algorithm  is  randomized,  it  is  possible  to  lose  the  best  solution  due  to  random  chance.  There  is 
no  guarantee  that  the  best  solution  in  the  current  population  will  be  selected  for  lecombinatirm,  or  that  if  it 
is  selected,  that  the  mutation  and  crossover  operators  will  not  destroy  some  of  its  informaticm  as  it  is 
passed  to  its  successors.  If  die  best  solution  is  lost  from  die  population,  diere  is  no  guarantee  that  the  solu¬ 
tion  will  be  found  again.  Methods  such  as  elitist  strategies  have  been  proposed  to  address  this  problem. 
Elitist  strategies  ensure  that  the  best  solution  in  the  previous  population  is  transferred  to  the  current  gener¬ 
ation  unaltered.  A  common  implementation  of  this  replaces  die  worst  chromosome  in  die  current  genera¬ 
tion  with  the  best  from  the  previous  generation. 

Beyond  the  mechanisms  inherent  to  the  GA  which  influence  its  ability  to  optimize  functions,  there  is  also 
the  issue  of  problem  representation.  Although  the  majority  of  GA  research  has  been  conducted  using  a 
binary  soluticHi  representation,  this  is  not  the  only  method  of  encoding  problem  solutions.  Different  meth¬ 
ods  which  alter  the  crudinality  of  die  alphabet  us^  for  encoding,  or  the  interpretation  of  the  encoding,  can 
have  an  enormous  impact  on  the  performance  of  the  GA  (Liepins  &  Vose,  1990].  A  good  overview  of  the 
issues  and  difficulties  involved  in  using  GAs  for  function  optimization  can  be  found  in  [De  Jong,  1992]. 


1.2.  Introduction  to  Competitive  Learning 

Competitive  learning  (CL)  is  often  used  to  duster  a  number  of  unlabeled  points  into  distinct  groups.  Mem¬ 
bership  into  each  group  is  based  upon  the  similarity  of  points  with  respect  to  the  characteristics  in  study. 
The  procedure  is  unsupervised,  as  there  is  no  a  priori  knowledge  of  how  many  groups  exist,  or  what  eadi 
group's  distinguishing  features  may  be.  The  hope  is  that  the  CL  procedure  will  be  able  to  determine  the 
most  relevant  features  for  class  formation  and  then  be  able  to  duster  points  into  distinct  groups  based  on 
these  features. 

Competitive  learning  is  often  studied  in  the  context  of  <irtificial  neural  networks  as  it  is  easily  modeled  in 
this  form.  A  typical  competitive  learning  network  is  shown  in  Figure  1.  The  inputs  correspond  to  the  fea¬ 
ture  vector  for  each  point.  The  outputs  correspond  to  the  class  in  which  the  network  has  placed  the  point. 
In  this  network,  there  are  two  t3rpes  of  connections,  exdtatory  and  inhibitory.  The  inhibitory  connections, 
between  output  units,  ensure  that  only  one  output  is  turned  on  at  a  time.  The  output  unit  that  is  turned  on 
is  the  one  which  has  the  largest  net  input.  The  exdtatory  connections  contribute  to  the  net  input  of  the  out¬ 
puts.  The  algorithm  used  to  train  the  network  is  described  below. 


Figure  1:  A  competitive  learning  network.  The  hollow  arrows  represent  inhibitory 
cormections,  the  filled  arrows  represent  exdtatory  connections. 

The  initial  weights  of  the  network  are  chosen  randomly,  and  are  subject  to  normalization  constraints, 
which  are  detailed  in  [Hertz,  Krogh  &  Palmer,  1993].  The  activation  of  the  output  units  is  calculated  by  the 
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fdlowing  {(Kinula  (in  which  w  is  the  weight  the  ccmnectum  between  i  and  j): 

output.  =  ^w^^'Kinput. 

In  a  competitive  learning  network,  only  the  output  unit  widi  die  largest  activation  is  allowed  to  fire  for 
each  point  presented.  The  winning  outjmt  unit  corresp<mds  to  the  clarification  of  the  input  point  During 
training,  the  wei^ts  of  die  winning  output  unit  are  moved  closer  to  die  presented  point  by  adjusting  the 
weights  according  to  the  following  rule  (LR  is  die  learning  rate  parameter): 

Aw.j  =:LRx  {input. - ty.p 

The  process  of  training  the  CL  network  involves  repeatedly  presenting  each  point  to  the  network  until  the 
network  has  stabilized.  Although,  in  generaL  the  network  carmot  be  guaranteed  to  stabilize  if  weight 
updates  are  made  after  each  point  is  presented  to  the  network,  odier  heuristics  may  be  used  to  determine 
when  to  stop  training.  One  pTsible  mediod  of  ensuring  stability  is  to  reduce  the  learning  rate  gradually  to 
0,  as  die  tot^  number  of  pattern  presentation  increases  [Hertz,  Kro^  Palmer,  1993]. 

After  the  network  training  is  complete,  the  weight  vectors  for  each  of  the  output  units  can  be  considered 
prototype  vectors  for  one  of  the  discovered  classes.  The  attributes  with  the  la^  weights  are  the  defining 
characteristics  of  the  cla-«M  represented  by  the  output  It  is  die  notion  of  creating  a  prototype  vector  which 
win  be  central  to  the  discu^ums  of  PBIL.  The  unsupervised  nature  of  the  algoridim  will  not  be  main¬ 
tained,  as  die  members  of  the  class  of  interest  will  be  easily  detennirud^le;  this  will  be  returned  to  in  much 
greater  detail  in  the  next  section.  Althou^  the  method  of  training  described  in  diis  section  has  been  unsu¬ 
pervised,  supervised  competitive  learning  has  been  explored  in  artificial  neural  network  literature,  and  is 
central  to  the  discussion  in  Section  2  [Kohonen  et  al,  1988]  (Kohonen,  1989].  The  examination  of  PBIL 
through  the  perspective  of  supervised  competitive  learning  yields  insights  which  can  lead  to  a  much  more 
efficient  algorithm  than  the  base-line  PBIL  described  in  die  next  section.  Improved  versions  of  PBIL  are 
described  in  section  4,  and  are  empirically  examined  in  section  5. 


2.  Examining  the  Genetic  Algorithm:  The  Role  of  a  Population 

One  of  the  fundamental  attributes  of  the  genetic  algoridim  is  its  ability  to  search  the  function  space  from 
multiple  points  in  parallel.  In  this  context,  parallelism  does  not  refer  to  the  ability  to  parallelize  the  imple¬ 
mentation  of  genetic  algorithms;  rather,  it  refers  to  the  ability  to  represent  a  very  large  number  of  potential 
solutions  in  the  population  of  a  single  generation.  This  section  describes  the  implicit  parallelism  of  GAs, 
and  explicit  medko^  of  ensuring  parallelism.  Because  of  the  failure  of  the  implicit  parallelism  to  exist  in 
the  latter  parts  of  genetic  search  in  the  simple  genetic  algorithm,  the  utility  of  maintaining  multiple  points, 
through  die  use  of  a  population,  decreases.  The  limited  effectiveness  of  die  population  in  the  latter  por¬ 
tions  of  search  allows  it  to  be  modeled  by  a  probability  vector,  specifying  the  probability  of  each  position 
containing  a  particular  value.  This  concept  is  central  to  the  PBIL  algorithm,  and  is  the  focus  of  this  section. 


2.1.  Implicit  and  Explicit  Parallelism  in  Genetic  Search 

In  every  generation  during  die  run  of  a  GA,  a  population  of  potential  solutions  exists.  The  search  of  the 
function  space  progresses  from  these  points,  and  the  schemata  represented  in  these  points,  in  parallel.  This 
is  in  contrast  to  other  search  techniques,  such  as  simulated  armealing  [Kirkpatrick  et  al.,  1983],  or  hill¬ 
climbing,  in  which  a  single  point  in  the  functicm  space  is  used  as  the  basis  for  search.  The  ability  to  search 
multiple  schemata  in  ea^  soluticm  vector  has  been  termed  implicit  parallelism.  However,  useful  parallel- 
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ism,  at  the  level  of  the  population,  is  not  easily  maintained.  It  is  possible  for  the  populaticm  to  converge  to 
very  similar  solution  vectors.  Once  the  population  has  ccmveiged,  the  ability  for  crossover  operators  to  aid 
in  exploring  i^w  portions  of  the  function  ^ace  is  greatly  hiruiered.  For  a  review  of  typical  crossover  oper¬ 
ators  see  Appendbt  A.  Premature  convergence  of  a  popt^tirm  can  occur  when  dre  populaticm  becomes  too 
homogenous.  As  the  GA  allocates  an  exprmentially  increasing  number  of  trials  to  better  solutions  [Gold¬ 
berg,  1989],  dre  entire  population  may  come  to  be  domirrated  by  very  similar  solution  vectors  when  several 
ccmsecutive  generations  do  not  develop  novel  high  evaluatirm  solution  vectors. 

In  a  traditional  GA,  the  problem  of  premature  convergerKe  and  the  trap  of  local  minima  has  been  partially 
addressed  by  the  mutation  operator.  The  mutation  operator  inserts  random  changes  into  the  populaticm, 
without  reg^  to  whether  the  effects  are  beneficial  or  detrimental.  However,  other  mechanisms,  which 
help  to  maintain  the  parallelism  explicidy,  have  been  proposed  to  address  the  problem  of  diversity  loss 
and  maintaining  pru’allelism  in  search  [Eschelman,  1991]  [Goldberg,  1989]  [D^  Jong,  1975]  [Goldberg, 
1987].  To  demonstrate  the  importance  of  maintaining  parallelism  in  genetic  search,  the  techruque  pre¬ 
sented  here  to  implement  explicit  parallelism  is  subpopulation  evolution. 

One  method  of  implementing  explicit  parallelism  is  through  models  of  genetic  algorithms  often  referred  to 
as  "island  models"  or  "coarse/fine  grain  parallel  GAs"  etc.  The  underlying  prenuse  of  these  models  is  that 
although  genetic  search  often  loses  the  parallelism  inherent  in  a  single  large  population  structure,  it  is  pos¬ 
sible  to  maintain  parallelism  through  the  use  of  multiple  subpopulations,  b  this  model,  the  single  large 
population  of  the  traditional  genetic  algorithm  is  divided  into  many  smaller  subpopulation.  Each  subpop¬ 
ulation  evolves  its  chromosomes  primcuily  independently  of  the  other  subpopulations.  Interaction,  in  the 
form  of  chromosome  swapping  l^tween  subpopulatiorts,  is  restricted,  and  is  based  upon  temporal  and 
spatial  considerations  [Whitley,  1991]  [Whitley,  1992]  [Baluja,  1992]  [Muhlenbien,  1989]  [Schleuter,  1990]. 

The  motivation  of  explicit  parallelism  is  largely  based  upon  the  tenets  of  the  theory  of  punctuated  equilib¬ 
ria.  The  theory  of  punctuated  equilibria,  and  its  relation  to  GAs  is  described  in  [Cohoon  et  al.,  1988]: 

Punctuated  Equilibria  is  based  upon  two  principles:  aUopatric  speciation  and  stasis.  AUopatric 
speciation  involves  the  rapid  evolution  of  new  species  after  a  small  set  of  members  of  species, 
peripheral  isolates,  becomes  segregated  into  a  new  environment.  Stasis,  or  stability,  of  a  spe¬ 
cies,  is  simply  the  notion  of  lack  of  change.  It  implies  that  after  equilibria  is  reached  in  an  envi¬ 
ronment,  there  is  very  little  drift  away  tom  the  genetic  composition  of  species. ...  Pimctuated 
Equilibria  stresses  that  a  powerful  method  for  generating  new  species  is  to  thrust  an  old  spe¬ 
cies  into  a  new  envirorunent,  where  change  is  beneficial  and  rewarded.  For  this  reason  we 
should  expect  a  genetic  cdgorithm  approach  based  upon  punctuated  equilibria  to  perform  bet¬ 
ter  than  the  typical  single  environment  scheme  [Cohoon  et  al,  1988]. 

An  example  which  clearly  displays  the  benefits  of  using  a  parallel  GA  in  place  of  a  traditional  GA  is  shown 
below.  The  inability  of  a  traditional  genetic  algorithm  to  maintain  simultaneously  multiple  equally  good 
points  in  the  function  space  prevents  the  GA  tom  finding  the  global  optima.  As  the  parallel  genetic  algo¬ 
rithm  (pGA)  model  is  able  to  mciintain  multiple  good  points,  the  pGA  is  able  to  solve  the  problems  more 
consistently.  This  problem  is  attempted  with  a  single  population  GA  with  100  members,  and  an  "island" 
model  GA  with  5  populations,  each  consisting  of  20  members.  The  problem  and  results  are  shown  below, 
see  Figure  2.  This  test  problem  was  created  jointly  with  Juels  [Juels,  1993]. 

fix)  =  M  AX  io  ix),  z  ix) )+ REWARD 

REWARD  ^  r  (0  (i)  S  T)  A  (z  (i)  >  T)  =>  LENGTH 

0 

LENGTH  is  the  number  of  bits  in  the  solution  string 

o(x)  is  the  number  of  contiguous  ones  starting  at  the  first  position 

z(x)  is  the  number  of  contiguous  zeros  ending  at  the  last  position 

T  is  the  threshold,  less  than  LENGTH/2 

MAX  returns  the  larger  of  its  two  arguments. 
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2a.  strtmLMtamaikMMiM  2b.  SMutUmiiiMaikraMu 


Figure  2:  The  failure  of  a  simple  GA  to  maintain  multiple  good  points  in  the  function  space. 

This  exemplifies  the  benefits  of  using  divided  subpopulations  for  explicitly  preserving 
diversity.  Average  of  20  runs  shown. 


Admittedly,  this  problem  is  contrived  to  show  the  potential  benefits  of  using  parallel  populations.  The  sin¬ 
gle  large  population  converges  to  a  suboptimal  solution  of  either  all  ones  or  all  zeros,  ^th  multiple  sub¬ 
populations,  each  subpopulatitm  can  evolve  to  a  difierent  point.  The  eventual  interaction  between 
subpopulations  can  coinbine  these  points  to  find  globally  optimal  solutions.  In  general,  optimization  prob¬ 
lems  may  not  have  the  same  characteristics  necessary  to  reveal  such  pronounced  benefits  of  pGAs  in  com¬ 
parison  to  traditioiial  GAs.  Nonetheless,  pGAs  have  shown  benefits  in  a  large  number  of  optimization 
problems,  both  in  terms  of  the  number  of  evaluations  performed  and  die  final  results  obtained  [Husbands, 
1990],  [Whitley  &  Starkweather,  1990]  [Davidor  et  al.,  1993]  [Gordcm  &  Whitley,  1993]. 

The  previous  paragraphs  ^owed  that  a  simple  genetic  algorithm,  with  a  single  population,  caimot  main¬ 
tain  dissimilar  points  in  the  function  space.  The  "Fundamental  Theorem  of  G^etic  Algorithms"  does  not 
address  this.  This  theorem  is  stated  below: 

"Short,  low  order,  above  average  schemata  re(%ive  exponentially  increasing  trials  in  subse¬ 
quent  generations"  [Goldberg,  1989] 
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The  inability  to  provide  increasing  trials  to  multiple  dissimilar  points  is  akin  to  the  problem  of  genetic  drift 
in  biologkal  systems.  Due  to  stoduutic  sampling  errors,  a  populatum  will  converge  to  a  single  point  whoi 
presented  widh  multiple  points  in  ttw  function  space  which  have  the  same,  or  only  an  extremely  small  dif- 
foential  in  tlwir  evaluatUms.  In  describing  the  effect  of  genetic  drift,  Goldberg  and  Richardson  summarize 
ffiis  finding: 

"...  the  theorem  [Fundamental  Theorem  of  Genetic  Algorithms  [Goldberg,  1989]],  assumes  an 
infinitely  large  population  size.  In  a  finite  size  population,  even  when  there  is  no  selective 
advantage  for  either  of  two  competing  alternatives...  the  population  will  converge  to  one  alter¬ 
native  or  the  other  in  finite  time  (De  Jcmg,  1975;  [Goldberg  &  Segrest,  1987]).  This  problem  of 
finite  populations  is  so  important  diat  geneticists  have  given  it  a  special  name,  genetic  drift 
Stochastic  errors  tend  to  accumulate,  ultimately  causing  the  poptilation  to  converge  to  one 
alternative  or  another"  [Goldberg  tt  Richardson,  1987]. 

This  experiment  was  included  to  demonstrate  that  the  simple  genetic  algorithm  carmot  maintain  implicit 
parallelism  in  the  latter  portions  of  search.  In  systems  which  can  maintain  the  parallelism  in  search,  the 
problem  is  easily  solvable.  In  the  next  section,  the  population  is  replaced  by  a  single  probability  vector, 
which  specifies  the  probability,  in  each  position,  of  contairring  a  specific  value.  A  single  vector  can  be  used, 
because  fire  parallelism  (the  ability  to  maintain  multiple  dissimilar  high  evaluation  points  in  the  function 
space)  is  not  rruuntamed  in  a  single  population  GA.  In  modelling  GAs  with  more  than  a  single  population, 
multiple  probability  vectors  can  be  u^.  The  use  of  multiple  probability  vectors  is  not  exarruned  here; 
however,  it  is  returned  to  in  the  discussion  of  future  directions. 


2JL  Replacing  the  Population 

The  PBIL  algorithm  attempts  to  create  a  probability  vector  from  which  samples  can  be  drawn  to  produce 
the  next  generation's  population.  As  in  standard  GAs,  it  is  assumed  that  die  solution  is  encoded  into  a 
fixed  length  vector.  Ignoring  the  contribution  of  mutation,  the  expected  distribution  of  values  in  each  posi¬ 
tion  of  the  population  during  generation  G,  can  be  computed  based  upon  the  population  of  generation  G- 
1.  Assuming  a  fully  generational  GA  (each  member  of  the  population  is  replaced  in  the  subsequent  gener¬ 
ation),  fitness  proportionsd  selection,  and  a  general  pair-wise  recombination  operator,  the  probability  of 
value  j  appearing  in  position  i  in  a  solution  vector  x,  in  a  population  at  generation  G,  can  be  computed  as 
follows: 


PUJ) 


=  PiXi=j)  = 


^  EvaluateVector{v) 
V  €  Populationf~_  i  a  v,  =  j 

^  EvaluateVector{v) 
v€  PopulationQ_^ 


This  is  simply  a  counting  argument,  weighted  by  the  evaluation  of  each  solution  string.  Given  a  popula¬ 
tion,  a  unique  representation  can  be  made  by  a  probability  matrix  defined  by  the  above  equation.  Two 
comments  should  be  made  about  this  representation.  First,  many  populations  will  have  identical  probabil¬ 
ity  matrices.  Second,  if  the  above  representation  was  used  to  represent  the  population,  and  samples  for 
generation  G  were  drawn  by  sampling  points  based  upon  this  distribution,  the  solution  vectors  pi^uced 
are  unlikely  to  improve  over  those  in  generation  G-1.  This  is  due  to  the  implicit  assumption  that  each  bit 
position's  value  is  independent  of  all  other  bit  position's  values  across  individual  solution  vectors.  Tradi¬ 
tional  crossover  does  not  assume  this;  a  pair-xoise  crossover  operator  maintitins  more  information  between 
the  bit  positions,  as  the  composition  of  the  "children"  solution  strings  is  chosen  from  only  two  "parent" 
solution  strings. 

Although  a  population  represented  directly  as  mentioned  above  may  not  be  useful  for  sampling  in  order 
to  generate  the  next  generation's  vectors,  a  variation  of  the  above  representation  can  be  useful.  From  a 
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populaticm  of  size  N,  consider  probabilistically  (based  upon  fitness)  selecting  N  solution  vectors  for  recom- 
binaticm.  These  newly  generate  vectors  can  represented  as  a  probability  matrix  by  simply  counting  the 
number  of  occurrences  of  each  value  in  each  bit  position.  This  method  is  described  in  greater  detail  by 
[Sysweida,  1992].  This  probability  matrix  can  be  used  to  create  a  new  population.  This  representation  has 
been  used  to  simulate  crossover.  It  is  compared  to  other  crossover  operators  in  [Eshelman  &  Schaffer, 
1993]. 

In  a  irumner  similar  to  the  methods  described  above,  d\e  population-based  incremental  learning  (PBIL) 
algorithm  uses  a  probability  vector  to  describe  the  population  of  a  genetic  algorithm.  In  a  binary  encoded 
solutions  string,  the  probability  vector  specifies  the  probability  of  each  bit  position  contctming  a  ‘V.  The 
probability  of  the  bit  position  containing  a  '0'  is  obtained  I'y  subtracting  the  probability  specified  in  the 
vector  from  1.0.  For  example,  see  Figure  3.  Although  from  this  point,  PBIL  will  be  considered  for  solution 
vectors  encoded  in  a  birrary  alphabet,  this  method  can  be  appli^  to  higher<ardinality  representations. 


Population  #1 

0011 

1100 

1100 

0011 


Population  #2 

1010 

1100 

1100 

1100 


Population  #3 

1010 

0101 

1010 

0101 


Representation 

0.5, 0.5, 0.5, 0.5 


Representation 

1.00.75  0.25  0.0 


Representation 

0.5  0.5  0.5  0.5 


Figure  3:  The  probability  representation  of  3  small  populations  of  4  bit  solution  vectors; 

population  size  is  4.  Notice  that  the  first  and  third  representation  for  the  population  are 
the  same,  although  the  solution  vectors  each  represents  axe  entirely  different. 

The  PBIL  algorithm  attempts  to  create  a  probability  vector  which  can  be  considered  a  prototype  for  high 
evaluation  vectors  for  the  function  space  being  explored.  A  very  basic  observance  of  genetic  algorithm 
behavior  provides  the  fundamental  guidelines  for  the  performance  of  the  PBIL.  One  of  the  key  features  in 
the  early  portioixs  of  genetic  optimization  is  the  parallelism  in  the  search;  many  diverse  points  are  repre¬ 
sented  in  the  population  of  a  single  generation.  In  representing  the  population  of  a  GA  in  terms  of  a  prob¬ 
ability  vector,  the  most  diversity  will  be  found  in  setting  the  probabilities  of  each  bit  position  to  0.5.  This 
specifies  that  generating  a  0  or  1  in  each  bit  position  is  completely  random. 

The  PBIL  algorithm  uses  the  probability  vector  representation  for  defirung  a  population.  Rather  than  pas¬ 
sively  trarrsforming  each  population  into  a  probability  vector,  from  which  solution  vectors  are  generated 
and  recombined,  etc.,  the  aim  of  PBIL  is  to  actively  create  a  probability  vector  which,  with  high  probability, 
represents  a  population  of  high  evaluation  solution  vectors.  Unlike  the  mechanisms  inherent  to  a  GA, 
operations  are  not  defined  on  the  population;  rather,  in  PBIL,  operations  take  place  directly  on  the  proba¬ 
bility  vector.  These  mechanisms  are  derived  from  those  used  in  competitive  learning. 

In  a  marmer  similar  to  the  trairung  of  a  competitive  learning  network,  the  values  in  the  probability  vector 
are  gradually  shifted  towards  representing  those  in  high  evaluation  vectors.  A  simple  procedure  to  accom¬ 
plish  this  is  described  below.  The  probability  update  rule,  which  is  based  upon  the  update  rule  of  compet¬ 
itive  learning,  is  shown  below. 

The  probability  update  rule  described  above  is  the  similar  to  the  weight  update  rule  in  a  competitive  learn¬ 
ing  network  when  an  output  is  moved  towards  a  particular  sample  point.  It  is  interesting  to  note  that  the 
each  bit  is  examined  independently.  Above,  it  was  claimed  that  a  representing  each  bit  independently  dis- 
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probabilityj  =  {probability (l.O  — LR))  +  {LRXvectofj) 

protMbilityi  U  the  prabability  of  fenenting  a  I  io  bit  poaition  i. 

vecttMj  is  the  itfa  poaitioo  in  the  solutiao  vector  which  the  probability  vector  is  moved  towards. 

LR  is  the  teaming  rale 

regarded  much  of  the  information  which  standard  crossover  preserved.  The  reason  the  assiunption  of 
independeiKe  is  not  detrimental,  is  that  PBIL  does  not  attempt  to  represent  the  entire  population  by  the 
probability  vector.  Rather,  PBIL  represents  a  single  point,  bas^  upon  the  vector  with  the  highest  evalua¬ 
tion,  around  which  the  next  population  of  points  should  be  generated.  This  is  described  in  more  detail 
below. 

The  step  which  remains  to  be  defined  is  how  to  determine  which  solution  vectors  to  move  towards.  If  the 
good  vectors  were  known  a  priori,  the  problem  would,  of  course,  already  be  solved.  Several  alternative 
methods  have  been  explored.  The  basis  of  some  of  the  earlier  attempts  involved  generating  a  single  vector, 
and  deciding  whether  to  push  the  probability  vector  towards  the  generated  vector.  However,  a  more  effec¬ 
tive  method,  which  has  proven  empirically  to  be  more  resistant  to  getting  caught  in  local  minima,  is  to 
generate  a  number  of  vectors  all  based  upon  the  probabilities  specified  in  the  probability  vector,  and  to 
push  the  probability  vector  towards  the  generated  vector  with  the  highest  evaluation.  After  the  probability 
vector  is  updated,  a  new  set  of  vectors  is  produced  based  upon  the  updated  probability  vector,  and  the 
cycle  is  continued. 

Generating  a  population  of  potential  solutions,  rather  than  a  single  solution  vector,  is  an  attempt  to  main¬ 
tain  the  ability  to  explore  large  regions  of  space  in  a  parallelized  manner,  as  the  GA  does  in  the  early  stages 
of  search.  In  the  eruly  stages  of  search,  there  is  a  Icirge  amount  of  diversity  in  the  regions  of  the  function 
space  which  are  simultaneously  explored.  As  the  search  progresses  in  PBIL,  the  values  in  the  probability 
vector  move  away  from  0.5,  towards  either  0.0  or  1.0.  In  the  GA,  this  corresponds  to  the  respective  bit  posi¬ 
tions  in  the  majority  of  the  solution  strings  having  the  same  value.  As  the  search  progresses,  the  popula¬ 
tion  of  the  GA  tends  to  converge  around  a  good  solution  vector  in  the  function  space.  Analogously  to 
genetic  search,  the  progression  of  search  in  PBIL  converges  aroimd  a  single  point.  The  PBIL  algorithm  and 
a  standard  GA  face  the  same  problem  of  premature  convergence.  In  PBIL,  as  the  probabilities  become  very 
close  to  either  0.0  or  1.0,  the  similarity  in  the  vectors  generated  increases.  One  advantage  which  the  PBIL 
algorithm  offers  is  explicit  control  of  how  fast  the  population  converges.  The  setting  of  the  learning  rate 
parameter,  which  can  greatly  effect  the  speed  to  convergence,  is  explored  in  Section  3. 

A  concern  is  that  because  only  a  single  probability  vector  is  used,  it  may  have  less  expressive  power  than  a 
full  population  GA.  A  GA  which  uses  a  population  of  points  can  represent  a  large  number  of  points  simul¬ 
taneously.  For  example,  in  Figure  1,  the  representations  for  population  #1  and  population  #3  are  repre¬ 
sented  by  probability  vectors  of  0.5;  therefore,  it  is  urdikely  that  the  population  members  would  be 
regenerated  by  sampling  the  probability  vector.  This  appears  to  be  a  fundamental  limitation  of  PBIL,  as  it 
is  possible  for  a  GA  to  conteiin  either  of  these  populations.  However,  for  the  reasons  mentioned  previously 
(genetic  drift),  in  simulating  genetic  search,  a  traditional  single  population  GA  tvould  not  be  able  to  maintain 
either  of  these  populations.  Because  of  sampling  errors,  the  population  will  converge  to  one  point;  it  will  not 
be  able  to  maintain  multiple  dissimilar  points.  Therefore,  even  if  the  members  of  the  populations  shown  in 
Figure  3  had  equally  high  evaluations,  the  GA  would  be  unable  to  ii\aintain  them  in  its  population,  and 
would  converge  to  only  one.  Similarly,  in  PBIL,  the  values  of  0.5  will  quickly  be  changed  to  favor  either  0.0 
or  1.0  through  the  search's  progression;  the  probability  vector  can  ordy  represent  one  of  the  dissimdar 
points.  Methods  designed  to  address  this  problem,  which  implement  explicit  parallelism,  are  not  explored 
here.  However,  it  shotdd  be  noted  that  many  of  the  methods  are  applicable  to  both  GAs  and  PBIL. 

As  the  poptdation  which  is  represented  by  a  probability  vector  is  not  uruque,  this  aids  in  maintairung 
diversity  in  search.  For  example,  members  from  population  #1  and  population  #3  can  be  generated  by  the 
probability  vector  used  to  represent  these  populations.  In  traditional  GAs,  uniform  crossover  (see  Appen¬ 
dix  A)  also  often  has  the  same  characteristics  from  one  generation  to  the  next  -  extremely  dissimdar  chU- 
dren  can  be  produced  from  the  same  parents. 
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As  discussed  previously,  the  GA  does  not  only  rely  on  crossover  to  perform  extensive  search.  In  the  latter 
parts  of  search,  mutation  plays  a  prominent  role,  to  an  analogous  manner,  a  similar  operator  is  defined  in 
PBIL.  For  PBIL,  there  are  two  ways  of  defining  a  mutation  operator.  The  first  is  to  p>erform  the  mutation 
directly  on  the  vectors  generated.  The  second  method  is  to  perform  a  mutation  on  the  probability  vector; 
this  mutation  can  be  defined  as  a  small  probability  of  perturbation  on  each  of  the  positions  in  the  probabil¬ 
ity  vector.  Both  of  these  forms  of  mutation  have  the  same  effect  as  mutation  in  standard  genetic  edgo- 
rithms:  to  help  preserve  diversity.  For  the  experiments  conducted  in  this  study,  the  second  form  of 
mutation  is  implemented.  The  full  algorithm  is  described  in  Figure  4;  this  is  the  original  EGA  algorithm  as 
formulated  in  conjunction  with  [Juels,  1994].  The  role  of  the  crossover  operator  and  the  behavior  of  the 
population  over  multiple  sequential  crossover  steps  in  the  GA  forms  the  basis  of  the  EGA;  the  EGA  is  cur¬ 
rently  being  analyzed  from  tlus  perspective  by  Quels,  1994]. 

To  keep  the  analogy  with  a  GA  simple,  the  number  of  vectors  generated  at  each  iteration  is  defined  to  be 
the  population  size.  The  update  rule  to  the  probability  vector  is  investigated  from  the  perspective  of  com¬ 
petitive  leanung  in  the  next  section.  The  mutation  operator  and  its  benefits  are  explor^  more  thoroughly 
in  section  2.4. 


2.3.  The  Probability  Vector  and  Competitive  Learning 

The  probability  vector  maintained  by  PBIL  can  be  viewed  as  a  prototype  vector  for  generating  solution 
vectors  which  have  high  evaluations  with  respect  to  the  available  knowledge  of  the  function  space,  to  each 
generation,  the  probability  vector  is  adjusted  to  represent  the  current  highest  evaluation  vector.  As  values 
in  the  bit  positions  become  more  consistent  between  the  highest  evaluation  vectors  produced  in  subse¬ 
quent  generations,  the  probabilities  of  generating  the  value  in  the  bit  position  increases.  As  mentioned 
before,  the  unsupervised  context  of  traditional  competitive  learning  is  no  longer  appropriate.  The  feature 
of  interest  is  competitive  learning's  ability  to  find  a  prototype  for  high  evaluation  vectors.  However,  it 
should  also  be  noted  that  the  supervisory  signals  used  in  this  dgorithm  are  also  non-standard,  to  most  tra¬ 
ditional  supervised  classification  tasks,  the  sample  points  are  pre-labeled.  to  PBIL,  the  class  of  high  evalu¬ 
ation  vectors  is  defined  during  the  algorithm's  seardi.  to  each  population  of  points  generated,  the  highest 
evaluation  vector  produced  is  defined  to  be  in  the  class  of  interest.  Finally,  it  should  also  be  noted  that  the 
probability  vector  not  only  specifies  the  prototype  based  upon  the  high  evaluations  of  the  sample  solu¬ 
tions,  but  also  guides  the  secudi,  which  produces  the  next  sample  point  from  which  to  "learn". 

The  classes  of  problems  to  be  addressed  by  PBIL  are  unlike  many  other  problems  for  which  competitive 
learning  is  often  employed,  to  many  domains  to  which  CL  is  applied,  one  of  the  largest  difficulties  in  train¬ 
ing  the  CL-net  is  the  lack  of  available  training  data.  However,  there  is  an  abundance  of  training  data  in  the 
class  of  problems  attempted  here.  Training  data  is  available  through  the  evaluation  of  potential  solution 
vectors.  Nonetheless,  to  be  efficient,  the  algorithm  must  minimize  the  number  of  function  ev£duations  per¬ 
formed.  Therefore,  as  information  regarding  the  characterization  of  high  evaluation  vectors  becomes  avail¬ 
able,  it  is  incorporated  into  the  search;  the  updated  probability  vector  is  used  to  generate  the  next 
population  of  sample  points. 

Although  the  method  of  defining  classes  is  not  common,  the  supervised  training  of  this  algorithm  is  simi¬ 
lar  to  the  learning  vector  quantization  (LVQ)  method  presented  by  Kohonen  [Kohonen,  1988]  [Kohonen, 
1989].  Kohonen's  LVQ  attempts  to  find  the  prototype  vector  of  a  set  of  known  classes,  to  LVQ,  the  method 
by  which  the  output  unit  is  chosen  is  the  same  as  in  standard  CL.  However,  if  the  winner  is  a  class  to 
which  the  point  does  not  belong,  the  winner's  weights  are  adjusted  to  move  the  prototype  vector  away 
from  the  misclassified  point.  As  in  standard  CL,  if  the  class  is  correct  for  the  input  point,  the  wirmer's 
weights  are  adjusted  to  more  reliably  classify  the  point.  Similar  extensions  are  possible  for  PBIL,  such 
extensioirs  include  learning  from  misclassified  examples.  This  is  explored  in  greater  detail  in  section  4. 

The  PBIL  algorithm  also  has  ties  with  early  reinforcement  learning  algorithms  [Moore,  1994],  which  use  a 
probability  vector  approach  to  probabilistically  chose  an  action.  Such  learning  algorithnas  include  the  lin¬ 
ear  reward-  penalty  algorithm  and  its  specialization,  the  linear  reward-inaction  algorithm.  A  good  survey  of 
early  reinforcement  algorithms  can  be  found  in  [Kaelbling,  1990]. 
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P  <- inWalz*  probiMMy  vwlor.  (Each  po 
loop«QENERAnONS 

#  Generate  Samples 
l^loap«SAMPl£S 

samplai  ^  ganeiala  sample  vector  accctdbig  to  probabilities  in  P 
svakiaiiof^  4— evahjate  (sample)) 

§  Find  Best  Sample 

max  4— find  vector  corrsspomSng  to  maximum  evaluation 

#  Update  Probability  Vector 
1 4-  loop  «LENQTH 

P,  4“  P,  *  (1.0  -  Lfl)  +  max,  *  (LR) 

#  Mutate  ProbabUity  Vector 
I  4-  loop  «LENQTH 

if  (random  (0.1]  <  MUT.PROBABILiTY) 

P,  4—  P,  *  (1.0  -  MUT.SHIFT)  +  random  (0.0  or  1.0)  *  (MUT_SHIFT) 


USER  DEFINED  CONSTANTS: 

GENERATIONS:  number  of  iterations  to  allow  learning. 

SAMPLES:  tfie  population  size,  nutdber  of  samplea  to  produce  per  generation 
LENGTH:  length  of  encoded  solution 

MLrr.PROBABiUTY:  probability  of  mutation  oocurr^  in  each  position 
MUT.SHIFn  amowit  for  mutation  to  affect  the  probability  vector 
LR:  learning  rate 

VARIABLES: 

P:  probability  vector 
sampie|„sAMPt£S=  solution  vectors 
evaluatiort,  samples-  evaluations  of  the  solution  vectors 
max:  solution  vector  corresponcNng  to  maximum  eviduation 

Figure  4:  The  basic  PBIL  algorithnu  When  using  this  algorithm,  the  highest  evaluation  vectors 
should  also  be  maintained,  and  returned  at  the  end  of  the  procedure.  This  is  the  EGA 
algoridun  as  formulated  with  [Juels,  1994]. 

2.4.  The  Role  of  Mutation  in  GAs  and  PBIL 

One  of  the  goals  of  this  paper  is  to  examine  the  roles  of  the  operators  used  in  genetic  algorithms  to  deter¬ 
mine  what  their  relevance  is  to  the  PBIL  algorithm,  and  what  dteir  role  should  be.  In  this  section,  the  role 
of  the  mutation  op^^rator  is  examined.  The  role  of  mutation  is  more  pronounced  in  dte  latter  stages  of 
search,  when  diversity  in  the  population  is  lost  [Spears,  1992].  Crossover  is  valuable  in  the  early  portion  of 
search  as  it  takes  large  steps  towards  better  solutions  [Fo^,  1993].  In  numy  optimization  problons,  much 
of  the  fine-tuning  (moving  from  good  regions  of  the  function  space  to  optimal  solutirms)  in  genetic  search 
occurs  because  of  mutation  operator.  The  crossover  operator  is  less  effective,  as  described  previously,  as 


Population  Based  Incremental  Learning 


15/41 


recombining  similar  chromosomes  exclusively  ttirough  the  use  of  crossover  does  not  introduce  diversity 
or  induce  exploration  of  the  functi<m  space.  A  good  discussion  of  the  trade-offs  between  mutation  and 
crossover,  and  their  roles  in  performing  extensive  search  can  be  foimd  in  [Spears,  1992][Fogel,  1993]. 

This  section  is  provided  to  show  the  importance  of  the  mutation  operator  in  genetic  search  and  in  PBIL 
The  role  of  mutation  is  examined  in  an  easy  sample  problem,  described  below.  The  performance  of  a  GA 
with  and  without  mutation  and  PBIL  with  and  without  mutation  is  shown  for  the  sample  problem  in  Fig¬ 
ure  5.  Also  iiududed  in  Figure  5  is  the  performance  of  a  GA  which  has  mutation  only  in  the  later  parts  of 
the  search,  in  particular,  the  mutation  is  turned  on  only  after  generation  200.  These  are  included  to  provide 
an  intuitive  understanding  the  role  mutations  have  in  function  space  exploration.  The  performance  of  a 
next-step  hillclimber  is  given;  the  details  of  the  hiUdimber  can  be  found  in  Appendix  B.  For  the  problem 
attempt^  a  solution  vector  is  composed  of  200  bits,  and  die  population  size  for  each  algorithm  is  100  vec¬ 
tors.  The  mutation  rate  for  the  genetic  algoridim  is  0.001.  The  mutations  for  the  PBIL  occur  on  the  probabil¬ 
ity  vector  witii  probability  0.02.  In  each  mutation,  the  probability  vector  is  shifted  by  a  value  of  0.05  in  a 
randomly  chosen  direction.  The  graph  shown  represents  the  average  performance  of  20  runs,  allowed  to 
perform  50,000  function  evaluations  each  (500  generations).  Each  of  the  20  variables  xq  -  Xi9  are  encoded  in 
base  2,  with  10  bits. 
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Figure  5:  The  effects  of  mutation  on  the  search  ability  of  a  standard  genetic  algorithm  and  PBIL. 
The  left  graph  shows  GA  runs  with  mutation  and  crossover  operators  on  before  and 
after  generation  200  (see  diagram).  In  the  GA  runs  in  the  right  graph,  the  crossover 
operator  is  constandy  on  in  both  runs,  and  the  mutation  operator  is  on  only  in  the  tirst 
run.  The  highest  evaluation  the  hillclimber  (Appendix  B)  was  able  to  achieve  was 
2969,  after  50,000x20  evaluations  (758  resorts). 


Although  this  is  not  a  hard  function  to  optimize,  p2ut  of  the  difficulty  in  finding  optimal  solutions  for  this 
problem  is  the  simple  binary  encoding  used.  The  difficulty  exists  b^ause  of  "hamming  cliffs"  present  in 
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ttte  oicoding.  These  are  points  at  which  the  representaticm  of  the  next  sequential  number  is  largely  dissim* 
Uar  to  the  current  number's  representation.  For  exaiiq>ie,  consider  dte  binary  representaticxi  for  255  and 
256: 011111111  and  100000000.  This  type  of  encoding  increases  the  difficulty  of  search.  The  problem  can  be 
ovetoune  ffiiou^  the  use  of  gray-coding.  Gray  coding  ensures  ffiat  ffie  representatums  of  subsequent 
numbers  dffier  only  by  a  single  bit.  Gray  code  is  not  used  for  this  problem  to  display  one  of  ffie  many 
potential  difficulties  in  creating  arbitrary  black-box  functicm  optimization  techniques. 

Alffiough  it  is  clear  from  the  above  example  that  mutaticm  can  aid  PBIL  in  finding  better  fiiud  solutions, 
mutaticm  may  not  play  as  crucial  a  role  in  PBIL  as  it  does  in  standard  genetic  search.  The  maimer  in  which 
potential  soluticms  are  probabilistically  created  from  the  prototype  vector  allows  sampling  a  diverse  set  of 
points.  The  role  of  mutatiim  is  to  prevent  the  prototype  vector  from  too  quickly  converging  to  an  extreme 
value  (either  0.0  or  1.0)  in  each  of  the  bit  positions. 


3.  Examining  the  Effects  of  Changing  the  Learning  Rate 

There  are  four  parameters  which  can  be  adjusted  in  the  PBIL  algorithm  described  to  this  p  ae  popula¬ 
tion  size,  the  learning  rate,  the  mutation  probability  and  ffie  mutation  shift  (fi\e  magnitude  oi  me  efiect  of  a 
mutation  on  the  probability  vector).  In  consideration  of  the  length  of  the  paper,  this  section  will  concen¬ 
trate  only  on  the  setting  of  the  learning  rate  parameter  as  this  parameter  has  no  counterpart  in  traditional 
GAs.  Although  the  settings  of  population  size  and  the  mutation  probability  and  shift  can  have  a  significant 
impact  on  the  effectiveness  of  PBIL  and  a  GA,  there  is  an  abundance  of  literature  describing  the  tuning  of 
the  analogous  parameters  in  standard  genetic  idgorithms.  Although  extensive  testing  has  not  been  con¬ 
ducted  with  di^rent  population  sizes  and  different  mutation  rates  with  PBIL,  it  is  suspected  that  much  of 
this  literature  should  )deld  valuable  insights  into  appropriate  setting  of  the  parameters  in  PBIL.  Some  ref¬ 
erences  for  the  setting  of  population  size  are  (Muhlenbien,  1990]  [Goldberg,19891  (Goldberg,  1992).  Refer¬ 
ences  in  which  settings  of  mutation  rates  are  discussed  include  [De  Jong,  1975),  [^eck,  1993],  (Whitley  and 
Starkweather,  1990]. 

The  learning  rate  parameter  has  a  larger  effect  in  PBIL  than  in  standard  competitive  learning.  As  with 
competitive  learning,  the  learning  rate  affects  how  fast  the  prototype  vector  is  shifted  to  resemble  a  cor¬ 
rectly  classified  point.  Since  in  PBIL,  the  probability  vector  is  us^  to  generate  the  next  set  of  sample 
points,  the  leiuning  rate  also  effiects  which  portions  of  the  function  space  will  be  explored.  The  setting  of 
the  learning  rate  has  a  direct  impact  on  the  trade-off  between  exploration  of  the  function  space  and  exploi¬ 
tation  of  the  exploration  already  conducted.  In  this  context,  exploration  is  the  ability  of  the  algorithm  to 
search  the  function  space  thoroughly,  while  exploitation  refers  to  the  algorithm's  ability  to  use  the  infor¬ 
mation  it  has  gained  about  the  function  space  to  luurow  its  future  search.  For  example,  if  the  learning  rate 
is  0,  there  is  no  exploitation  of  the  information  gained  through  search.  As  the  learning  rate  is  increased,  the 
amount  of  exploitation  increases,  and  the  ability  to  search  large  portions  of  the  hmction  space  diminishes. 
This  is  illustrated  in  the  following  example. 

Eshelman  and  Schaffer  used  the  following  problem  to  compare  how  much  exploration  is  done  by  various 
crossover  operators  (Eschelman  and  Schaffer,  1993].  In  a  100  bit  solution  vector  x,  the  evaluation  of  the  vec¬ 
tor  is  defin^  as  follows: 


fix)  =  MAX(o{x),z{x)) 
where: 

o(x)  is  the  number  of  ones  occurring  in  x, 
z(x)  is  the  number  of  zeros  occurring  in  x. 
MAX  returns  the  larger  of  its  two  arguments. 
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Ihis  is  similar  to  the  first  problem  described  in  the  paper,  altlunigh  fius  does  not  impose  the  requirements 
of  building  consecutive  ones  firom  the  left,  and  consecutive  zeros  fitMn  die  right  Eschelman  and  Schaffer 
attempted  to  determine  how  soon  die  cmninitment  to  eidier  all  zeros  or  all  ones  was  made.  In  odier  words, 
if  die  final  solution  string  consisted  of  all  zeros,  what  was  the  largest  number  of  ones  ever  produced  in  a 
solution  string  at  any  time  during  the  run,  and  vice-versa. 

The  higher  die  learning  rate  parameter  is  set,  die  faster  the  algorithm  will  focus  search.  The  lower  the 
learning  rate,  the  more  exploratitm  will  occur.  In  the  Figure  6a,  four  learning  rates  are  explored  which 
show  rapid  amvergence.  In  Figure  6b,  the  learning  rates  are  gready  lowered,  and  die  rate  of  convergence 
is  significandy  slower.  In  Figure  6c,  the  convergmce  of  the  population  for  a  simple  genetic  algcmthm  is 
shown.  The  GA  run  is  shown  both  widi  and  without  elitist  sel^on.  For  a  more  complete  description  of 
elitist  selectiixi,  see  Appendix  C. 

From  the  figures,  it  is  clear  that  for  some  values  of  the  learning  rate  in  PBIL,  a  simple  GA  exhibits  a  greater 
ability  to  perform  extensive  exploration.  For  these  setting,  it  is  possible  to  design  problems  in  which  the 
GA  can  be  ensiued  to  be  perform  better  than  PBIL  on  average.  The  t)rpes  of  problei^  may  be  as  simple  as 
a  time-vaiying  function  which  gives  a  reward  for  finding  40  ones,  a^r  performing  some  large  number  of 
evaluations  (in  this  case  10000  would  make  the  GA  (w/o  elitist  selection)  more  successful  than  the  PBIL 
algorithms  shown  in  Figure  6a).  However,  lowering  die  learning  rate  in  PBIL  can  address  some  of  diese 
problems.  It  should  be  noted,  however,  that  if  the  learning  rate  is  set  too  low,  it  is  possible  to  cripple  the 
algorithm  so  that  it  may  take  an  enormous  number  of  samples  before  it  is  able  to  shift  the  probability  vec¬ 
tor  enough  to  escape  the  initial  oscillations,  as  is  shown  in  Figure  6b. 

The  implications  of  this  experiment  are  important  to  other  optimization  problems  than  the  one  shown.  If 
the  learning  rate  is  high,  the  initial  populations  generated  will  largely  determine  the  focus  of  the  search, 
without  enabling  the  algorithm  to  explore  the  function  space.  If  the  fimction  space  does  not  contain  local 
optima,  a  high  learning  rate  may  work  well.  However,  if  local  minima  could  be  a  problem,  lower  learning 
rates  allow  greater  exploration. 


4.  Extensions  to  the  Basic  Algorithm 

As  noted  earlier,  some  of  the  mechanisms  used  in  PBIL  are  similar  to  those  used  in  Kohonen's  version  of 
vector  quantization.  The  greatest  difierence  in  Kohonen's  LVQ  and  PBIL  is  the  target  outcome  of  each 
mediod.  Kohonen's  LVQ  attempts  to  create  decision  boundaries  for  classes  based  upon  known  labeled 
examples.  PBIL  attempts  to  optimize  a  function  based  upon  labeled  examples.  Regardless  of  these  difier- 
ences,  the  procedures  used  are  very  similar.  In  Kohonen's  version  of  vector  quantization,  the  learning  is 
supervised  as  the  classes  for  each  of  the  sample  points  is  known  a  priori  [Kohonen,  1988].  Similarly,  the 
class  of  interest  is  already  defined  in  the  PBIL  procediue.  In  Kohonen's  LVQ,  the  network  is  trained  with 
both  positive  and  negative  examples.  In  particular,  if  the  classificaticm  made  by  the  network  is  correct,  the 
weights  of  the  winning  output  are  streng^ened;  if  the  classification  is  incorrect,  the  wei^ts  are  weakened. 
In  an  analogous  manner,  it  is  possible  to  modify  PBIL  to  train  the  prototype  vector  with  more  than  the  sin¬ 
gle  best  solution  vector.  The  remainder  of  this  section  is  devoted  to  discussing  two  mediods  of  using  more 
than  the  single  best  generated  vector. 

In  each  iteration  of  PBIL,  a  munber  of  vectors  are  generated.  However,  in  the  version  of  the  algorithm 
described  to  this  point,  the  prototype  vector  is  <xily  adjusted  based  upon  the  single  best  solution  vector 
generated  in  the  current  generation.  When  large  populations  are  used,  this  has  the  potential  of  ignoring  a 
large  amount  of  the  work  and  exploration  perform^  by  the  algorithm.  Several  alternatives  are  possible. 
The  first  is  to  move  the  probability  vector  in  the  direction  of  the  best  M  vectors,  where  M  «  N  (N  is  die 
number  of  vectors  generated).  This  can  be  realized  through  several  approaches.  The  first  is  the  straightfor¬ 
ward  implementation:  the  protot3q>e  vector  is  moved  equally  in  the  direction  of  each  of  the  select^  vec¬ 
tors.  An  alternative  method  is  to  move  the  probability  vector  only  in  the  positions  in  which  there  is  a 


Population  Based  Incremental  Learning 


18/41 


Figure  6:  The  effects  of  learning  rate  on  PBIL's  ability  to  search  the  function  space.  Each  line  has 
two  parts,  one  moving  towards  the  top  of  die  graph,  the  other  to  the  bottom.  The  line 
moving  towards  the  top  of  the  graph  represents  the  maximum  evaluation  in  the 
generatioa  If  the  best  soluticm  in  the  generation  is  found  because  of  die  number  of  Is, 
then  the  line  extending  toward  the  lower  regions  represents  die  maximum  number  of 
Os  in  each  generation.  If  the  best  solution  if  found  bwause  of  die  number  of  Os  then  die 
lower  line  represents  the  maximum  niunber  of  Is.  Average  of  20  runs  shown. 

consensus  in  all,  or  most,  of  the  M  soludcm  vectors.  For  the  positions  in  which  there  is  no  consensus  in  the 
M  generated  vectors,  the  prototype  vector  is  not  shifted.  Another  rule  could  base  the  move  of  die  proto¬ 
type  vector  on  die  relative  evaluations  of  the  best  M  vectors.  However,  to  diis  point,  die  algoridim  can 
ignore  issues  regarding  scaling,  as  only  the  rank  is  important  As  described  earlier  in  the  paper,  problems 
regarding  scaling  have  often  proved  to  be  detrimentid  in  GA  search.  Althoug^i  methods  which  require 
such  scaling  may  prove  to  be  useful,  this  tactic  is  not  explored  in  this  paper. 

An  alternative  to  only  using  multiple  hi^  evaluation  vectors  is  to  use  low  evaluation  vectors.  In  this 
implementatum,  die  single  hi^iest  evaluation  vector  has  die  same  role  as  in  the  original  implementation. 
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However,  the  prototype  vector  is  also  moved  away  from  the  lowest  evaluation  vector.  A  naive  implemen- 
tation  of  this  is  to  simply  move  towards  the  complement  of  the  lowest  evaluation  vector.  However,  dus 
mediod  will  not  work  well  in  die  advanced  stages  of  search  since  the  best  and  worst  vectors  may  be  simi- 
lan  As  the  prototype  vector  becomes  more  fixed  towards  either  0.0  or  1.0  for  each  bit  positicm,  the  probabil¬ 
ity  of  generating  very  similar  vectors  increases;  therefore,  the  difference  betweoi  the  best  and  worst 
genoated  vectors  will  diminish  as  the  search  progresses  (of  course,  diis  does  not  imply  that  their  evalua¬ 
tion  will  not  be  vasdy  difierent).  If  the  bit-wise  difference  is  small,  moving  away  from  die  worst  vector 
could  be  counter-productive  as  it  will  also  move  away  from  the  best  vector  in  the  majority  of  die  bit  posi¬ 
tions.  A  simple  modification,  which  has  proven  to  work  well,  is  to  (xily  move  away  from  the  bits  in  die 
worst  vector  which  are  different  than  thosO  in  the  best  vector. 

One  of  the  drawbacks  widi  introducing  either  of  the  above  two  extensions  to  PBIL,  is  diat  both  introduce 
more  parameters  to  the  algoridim.  For  example,  the  first  method,  moving  tow2mls  multiple  good  vectors, 
requires  the  number  of  hi^  evaluation  vectors  considered,  M,  to  be  defined.  The  second  mediod,  moving 
away  from  bad  vectors,  requires  the  specification  of  the  learning  rate  for  the  negative  examples.  In  the  next 
section,  standard  PBIL  and  PBIL  with  moves  away  from  negative  examples  are  examined  empirically.  Sev¬ 
eral  settings  for  the  negative  learning  rate  are  investigated. 


5.  Empirical  Analysis 

In  this  section  standard  PBIL,  PBIL  which  includes  learning  from  negative  examples,  and  a  simple  GA  are 
compared  empirically.  All  PBIL  algorithms  have  the  following  parameters  (unless  otherwise  noted):  learn¬ 
ing  rate  0.1,  mutation  probability  0.02,  mutation  shift  0.05,  and  population  size  =  100.  The  algorithms  com¬ 
pared  <ire: 

1.  standard  PBIL  -  Negative  Learning  Rate  «  0.0  (S-PBIL).  This  is  the  EGA  algorithm  as 
developed  in  conjimction  with  Quels,  1994). 

2.  PBIL  -  Negative  Le2uning  Rate  =  0.025  (PBIL2  0.025) 

3.  PBIL  -  Negative  Learning  Rate  =  0.075  (PBIL2  0.075) 

4.  PBIL  -  Negative  Leeuning  Rate  =  0.1  (PBIL2  0.1) 

5.  PBIL  -  Naive  Implementation  of  Learning  from  negative  examples.  Negative  Learning 
Rate  =  0.025  (PBIL-N)  For  details,  see  section  4. 

6.  PBIL  with  only  move  away  from  low  evaluation  vectors,  do  not  move  towards  best. 
Learning  Rate  *  0.0;  Negative  Learning  Rate  =  0.1  (PBIL-LOW) 

7.  Standard  Genetic  Algorithm;  Population  Size  =  100;  Mutation  rate  =  0.001;  Crossover 
=  2pt;  Generational  Model  with  Elitist  Selection.  (SGA)  See  Appendix  C  for  a  more 
detailed  explanation. 

8.  Multiple  Restart,  Next-Step  Hill  Climber.  See  Appendix  B  for  a  detailed  explanation  of 
the  algorithm.  This  algorithm  was  allowed  lOOi^OOO  evaluations  per  run,  to  match  the 
number  of  evaluations  per  run  in  the  other  algorithms.  The  evaluations  presented 
here  are  highest  evaluations  per  run,  averaged  over  20  runs.  Note  that  in  all  of  these 
problems,  the  hillclimber  restarted  itself  in  random  positions  a  minimiun  of  81  times 
(bin  pack  1),  and  a  maximum  of  2397  times  (De  Jong's  F2). 

Several  different  values  of  the  negative  learning  rate  are  included  as  it  has  very  little  or  no  other  literature 
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devoted  to  its  setting.  Although  it  is  not  expected  that  PBIL-N  or  PBIL-LOW  will  be  able  to  optimize  ttie 
harder  functions  as  well,  they  are  iiududed  for  a  base-line  reference. 

The  twelve  problems  on  which  each  of  the  algorithms  are  compared  can  be  divided  into  five  groups.  The 
first  three  are  known  NP-  complete  problems:  Jobshop  Scheduling,  Travelmg  Salesman  Problems,  and  Bin 
Packing  problems.  The  fourth  set  of  problems  is  often  used  in  genetic  algorithm  literature  to  gauge  the  per¬ 
formance  of  genetic  algorithms;  they  are  numerical  optimization  problems.  The  fiftii  category  of  problems 
can  best  be  described  as  those  which  are  designed  to  trap  GAs  in  local  optima  or  plateaus.  IWo  problems 
are  explored  in  this  category:  tiie  first  is  tiie  order-4  deceptive  problem,  a  problem  which  is  extremely  diffi¬ 
culty  f(»  numy  single  population  GAs  to  optimize,  the  second  is  a  checkerboard  problem,  as  suggested  by 
[Boyan,  1993].  The  remainder  of  this  section  describes  tite  problems,  their  encoding,  and  the  performance 
of  the  seven  algoritiuns  on  each  of  foe  problems. 

In  this  section,  foe  performance  of  each  algorithm  is  shown  in  a  graph  representing  foe  best  evaluation  in 
each  generaticm,  averaged  over  20  ruits.  V^fith  the  description  of  each  of  these  experiments,  foe  perfor¬ 
mance  of  a  multi-start  next-step  hillclimber,  described  in  detiul  in  Appendix  B,  is  alM  provided. 


5.1.  Jobshop  Scheduling  Problems 

The  job-shop  schedviling  problem  is  an  important  problem  which  is  among  foe  worst  members  of  foe  NP- 
complete  problems  fGarey  &  Johnson,  19^].  Recently,  genetic  algorithms  have  been  applied  to  foe  prob¬ 
lem,  as  it  is  difficult  for  conventional  search  based  methods  to  find  near-optimal  solutions  in  a  reasonable 
amount  of  time  [Fang  et  al,  1993]. 

The  following  description  of  foe  job  shop  problem  is  given  in  [Fang  et.  al,  1993]: 

"In  foe  general  job  shop  problem,  there  are  j  jobs  and  m  machines;  each  job  comprises  a  set  of 
tasks  which  must  each  be  done  on  a  different  machine  for  different  specified  processing  times, 
in  a  given  job-dependent  order.  ...  A  legal  schedule  is  a  schedule  of  job  sequences  on  each 
machine  such  that  each  job's  task  order  is  preserved,  a  machine  is  not  processing  two  different 
jobs  at  once,  and  different  tasks  of  foe  same  job  are  not  simultaneously  being  processed  on  dif¬ 
ferent  machines.  The  problem  is  to  minimize  foe  total  elapsed  time  l^tween  foe  begiiming  of 
foe  first  task  and  foe  completion  of  foe  last  task  (foe  makespan)."  [Fang  et  al,  1993] 

The  problem  is  encoded  in  foe  foUowing  manner,  a  string  of  j  *  m  integers  are  encoded  into  a  binary  bit 
string,  X.  The  integers  <ue  in  foe  range  of  [l...j];  therefore,  foere  ate  log2j  *  j  *  m  integers  in  foe  encoding.  A 
circular  list,  C  of  jobs  is  maintained.  C  is  initialized  to  l...j.  The  following  procedure  is  repeated  until  C  is 
empty: 

1.  read  foe  next  log2j  bits  from  X 

2.  i  4—  convert  foe  bits  to  an  integer 

3.  job  y  4—  find  foe  ith  entry  in  list  C 

4.  task  t  4—  find  foe  next  task  to  be  scheduled  for  job  y  (this  specifies  foe  machine 
needed) 

5.  find  foe  first  viable  time  slot  for  task  t,  subject  to  foe  constraints  that  it  must  be 

•  after  foe  last  scheduled  task  for  job  y 

•  in  a  space  which  is  at  least  as  long  as  is  reqiiired  task  t 
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6.  inoement  tfie  last  scheduled  task  counter  for  job  y.  If  job  y  is  d<»ie,  ronove  it  fitun  C 

The  encoding  here  used  a  bit  string  to  encode  the  integers  in  the  range  of  l...j.  Fang  et.  al  used  chunks 
which  are  atomic  for  die  GA  [Fang  et  al,  1993].  Althou^  the  encoding  used  here  is  certainly  not  the  easi¬ 
est  space  in  which  to  find  solutiims,  it  is  used  to  provide  results  which  are  comparable  to  othu  algoriduns. 
As  the  makespan  is  to  be  minimized,  the  evaluation  of  the  potential  solution  is  (1.0/makespan). 

IVvo  standard  test  problems  are  attempted,  a  10  job,  10  machine  problem  and  a  20-job,  5-machine  problem. 
A  description  of  the  problems  can  be  found  in  [Mudi  &  Thompson,  1963).  The  results  are  shown  in  Figure 
7  and  Figure  8. 

The  hilklimbing  algoridim  was  unable  to  perform  as  well  as  the  other  algoriduns  on  this  problem,  as  it 
achieved  evaluation  of  0.00097  and  0.00077  (»i  the  10x10  and  20x5  jobshop  scheduling  problems,  respec¬ 
tively. 


M-Shael«iM 


Figure  7:  Job-Shop,  10x10 
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5JL  llnveling  Salesnum  Problems 

The  Ihiveling  Salesman  Problem  (TSP)  is  perhaps  the  most  hunous  of  the  NP-complete  problmns.  A  ver* 
Sion  of  the  TSP  is  examined  here  in  which  the  distances  between  cities  are  symmetric,  and  each  of  the  cities 
Ues  on  a  two  dimensional  plane.  However,  it  should  be  noted  tihat  ttiese  assumptions  are  not  used  explic¬ 
itly  by  the  algoriduns  tested.  The  object  of  the  problem  is  to  find  the  shortest  length  tour  which  visits  each 
city  exactly  cmce,  and  returns  to  the  origiiud  dty 

For  a  problem  with  N  cities,  the  encoding  reqviires  a  bit  string  of  size  Nlog2N  bits.  A  city  is  assigned  a  con¬ 
tiguous  substring  of  length  M  (M  =  log2N).  Each  substring  is  interpreted  as  a  binary  integer  in  the  range  of 
[0,  N-1].  The  city  with  file  lowest  integer  value  comes  first  in  the  tova,  the  city  with  file  second  lowest 
cmnes  second,  etc.  In  the  case  of  ties,  file  city  whose  substring  comes  first  in  the  bit  string  comes  first  in  the 
tour. 

This  encoding,  although  not  very  effective,  is  commcaily  used  when  encoding  the  tour  in  a  binary  string. 
Again,  it  is  used  here  to  produce  results  which  can  be  compared  to  ofiier  results  given  in  literature.  As  the 
length  of  the  tour  is  to  be  minimized,  the  evaluation  of  the  potential  solution  is  l/Tour_Length.  IWo  small 
pnfiilems  were  attempted:  the  standard  30  city  TSP,  and  the  50  city  TSP  [Whitley  et  aL,  1989].  The  results 
are  shown  in  Figure  9  and  Figure  10. 

Tl%  hillclimbing  algorithm  weis  unable  to  perform  as  well  as  the  other  algorithms  on  this  problem,  <is  it 
achieved  eveiluations  of  0.00194  and  0.00155  cm  the  30  and  50  city  TSP  problems,  respectively. 
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Figure  10:  TSP,  50  Qty. 


5J.  Bin  Packing 
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In  diis  problem,  there  are  N  bins  of  varying  capacities  and  M  elements  of  var3dng  sizes,  between  0.0  and 
1.0.  The  problem  is  to  pack  die  bins  with  elements  as  tighUy  as  possible,  without  exceeding  the  maximum 
capacity  of  any  bin.  This  problem  is  also  known  to  be  NP-complete  [Garey  and  Johnson,  1979]. 
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bi  the  problem  attempted  here,  the  error  of  a  particular  solution  is  measured  by: 

N 

ERROR  ■  21  \CAP^-AS5IGNED^\ 
i-1 

where: 

CAPi  is  die  capacity  of  bin  i 

AS^GNEDi  is  the  total  size  of  the  elements  assigned  to  bin  i. 


Ihe  problem's  solution  is  oicoded  in  a  bit  string  of  length  M  *  lo^tN.  Eadi  element  to  be  packed  is 
assig^ied  a  sequential  substring  of  lengdi  log2N.  The  subsu^g  is  converted  from  binary  to  dedmaL  The 
value  revealed  is  die  bin  in  which  the  element  is  placed.  As  the  error  in  packaging,  ERROR,  is  to  be  mini¬ 
mized,  die  evaluation  of  the  potential  solution  is  1.0  /ERROR. 

Three  problems  are  attempted,  eadi  with  a  string  size  lengdi  of  512  bits.  The  problems  attempted  are  of  the 
following  sizes:  16  bins,  128  elements  (Hgure  11);  4  bins,  256  elements  (Figure  12),  2  bins,  512  elements 
(Hgure  13). 

It  is  interesting  to  note  diat  <hi  die  sec«id  bin  packing  problem,  the  hilldimbing  procedure  was  consis¬ 
tently  able  to  outperform  all  of  the  other  algoridims  (average  evaluation  was  168,078).  On  the  first  bin 
pacing  problem,  die  hillclimbing  algorithm  did  not  perform  well  (average  evaluation  26).  On  the  third 
bin  packing  problem,  it  was  able  to  perform  well,  but  not  die  best  overall  (average  evaluation  28,677,215). 


MilPMUieasMKiasMMWii  MiiFKMie(i<MM.usiifini) 


Figure  11:  Bin  Packing  (16  Bins,  128  Elements) 
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Figure  12:  Bin  Packing  (4  Bins,  256  Elements) 


■hihckk«(2MM,3U«lMMals)  f  imiinio  BlaPidUa|(2Mn,5I2eleB(att) 


Figure  13:  Bin  Packing  (2  Bins,  512  Elements) 


5.4.  Standard  Numerical  Optimization  Problems 

In  addition  to  NP-complete  pioblems,  PBIL's  ability  to  optunize  functions  which  are  common  to  genetic 
algorithm  literature  is  also  examined.  The  first  two  functions  examined,  De  Jong's  F2  and  F3  [De  Jong, 
1975]  are  easily  optimized  functions  on  which  both  PBIL  and  the  standard  genetic  algorithm  do  very  well 
These  functions  are  rather  snuill,  and  are  easy  to  optimize  for  all  of  the  methods  examined  here.  Davis  has 
explored  the  value  of  using  these  functions  as  standard  benchmarks  in  [Davis,  1991].  The  third  function  is 
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based  up<m  a  function  found  in  the  paper  by  Gordon  &  Whitley;  it  in  turn  are  based  on  a  functions  by 
Griewangk  [Gordon  &  Whitley,  1993].  The  encoding  of  the  functions  are  slightly  modified  fiom  their  orig¬ 
inal  encodings;  the  problems  and  the  modifications  are  described  below. 

Each  function  attempted  here  used  standard  binary  encoding  for  all  of  the  variables.  Each  variable  was 
encoded  as  a  contiguous  string  in  the  solution  vector.  The  problems  are  optimized  as  maximization  prob¬ 
lems. 

5.4.1.  De  Jong  F2 

The  only  differeiKe  between  the  encoding  of  De  Jong's  F2  used  here  and  in  its  original  formulation  is  that 
it  was  origiiuilly  formulated  as  a  minimization  problem,  with  the  absolute  minimum  at  0.0.  In  this  formu¬ 
lation,  the  problem  is  a  maximization  problem.  The  evaluation  of  the  original  F2  is  subtracted  from  the 
maximum  of  the  function,  3905.93.  The  function  is  shown  below.  The  results  are  shown  in  Figure  14. 

All  of  the  algorithms  performed  well  on  this  problem.  The  hillclimbing  algorithm's  average  evaluation 
wets  3905.93  (the  optimal). 


fix)  =  (3905.93)  -  (100)  X  (xJ-Xj)^  (l-x^)^ 
-2.048  Sx,...X2<  2.048 


entaniio’  DeJoagFI  EniMaailo’  0ejaii(f2 


Figure  14:  De  Jong's  F2. 


5.4.2.  De  Jong  F3 

As  with  the  previous  problem,  De  Jong's  F2,  the  only  difference  between  the  encoding  of  De  Jong's  F3  used 
here  and  in  its  origin^  formulation  is  that  it  was  originally  formulated  as  a  mmimization  problem.  In  this 
formulation,  the  problem  is  a  maximization  problem.  A  constant  value  of  30.0  is  added  to  the  function  to 
make  the  global  minimum  at  zero.  The  evaluation  of  the  new  F2  is  subtracted  from  the  maximum  of  the 
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functicm,  55.0,  to  make  dw  problem  a  maximization  problem  with  a  minimum  of  0.0.  The  resulting  hmc- 
tion  is  shown  below  (in  non-simplified  fonn).  The  results  are  shown  in  Figure  15. 

All  of  die  algorithms  performed  well  on  this  problem.  The  hillclimbing  algorithm's  average  evaluation 
was  55.0  (the  optimal). 


fix)  =55.0- 


-5.125x^...jc5<5.12 


DcJobiFS  Dejoogra 


Figure  15:  De  Jong's  F3. 


5.4.3.  Griewangk 

This  problem  was  taken  from  [Gordon  &  Whidey,  1993].  The  problem  is  originally  attributed  to 
Griewangk.  The  original  problem  was  a  minimization  problem.  In  o^er  to  cast  the  problem  as  a  maximi¬ 
zation  problem,  the  problem  is  to  maximize  the  reciprocal  of  the  original  function.  To  ensure  that  a  value  of 
0.0  does  not  appear  in  the  denominator,  0.1  is  added  in  the  denominator.  The  function  is  shown  below.  The 
results  are  shown  in  Figure  16. 


fii)  = 


1.0 
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0.1  + 
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The  ifiUcliinbing  aig(»itiim  was  abk  to  achieve  an  average  final  8C(»e  of  10.6223.  Tltis  is  the  hi^iest  evalu¬ 
ation  adueved  on  file  this  problem  with  any  alg(»ifiun  teked  in  fiiis  study. 


dtawHii  MM  Gikma^ 


Figure  16:  Function  Based  Upon  Griewangk's  functicm  [Gordcm  &  Whitley,  1993]. 


5.5.  Strong  Local  Optima  Problems 

The  last  two  problems  studied  in  this  paper  are  the  order-4  deceptive  problem  and  the  checkerboard  prob¬ 
lem.  The  order4  deceptive  problem  was  originally  suggested  by  Whitley  Sc  Starkweafiier  [Whitley  & 
Starkweafiier,  1990].  problem  is  composed  of  10  sub-problems,  each  of  which  is  composed  of  4  biis. 
The  bits  of  each  sub-problem  are  maximeilly  distributed  throu^  the  40  bit  vector  The  evaluation  of  each  4 
bit  problem  is  show  below.  The  objective  is  to  maximize  the  siun  of  the  evaluatirais.  This  is  a  vary  hard 
problem  for  GAs  to  optimize  as  the  vector  of  aU  0's  has  an  extremely  large  basin  of  attraction,  while  the 
vector  of  cdl  I's  -  the  optimal  solution  -  has  none.  However,  this  problem  should  be  less  deceptive  for  meth¬ 
ods  of  optimization  which  very  qvuckly  restart  search  in  random  locations. 

5.5.1.  Order-4  Deceptive 

Each  of  the  10  subproblems  is  evaluated  as  shown  in  Table  I,  each  bit  string  is  maximally  dispersed 
throughout  the  entire  solution  string.  The  results  are  shown  in  Figure  17. 

The  average  evaluation  by  the  hill  climbing  algorithm  was  291.40,  this  is  higher  than  any  of  the  algorithms 
tested.  There  were  approximately  1,458  restarts  in  the  2000x100  evaluations  performed  during  each  run. 
The  average  evaluation  of  each  search,  before  restarting,  was  282.48. 
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Table  I:  Evalnatioiu  for  Order-4  Fully  Deceptive  Problem 


SOLUTION 

EVALUATION 

SOLUTION 

EVALUATION 

STRING 

STRING 

1111 

30 

0110 

14 

0000 

28 

1001 

12 

0001 

26 

1010 

10 

0010 

24 

1100 

08 

0100 

22 

1110 

06 

1000 

20 

1101 

04 

0011 

18 

1011 

02 

0101 

16 

0111 

00 

Hi 

Ord»-41>te(plh« 

C*Ha«Hi 

Otfl(f-4  Dcecpiive 

Figure  17:  Tenfold  Order  4  Deceptive  Problems  -  Each  subproblem  is  maximally  interleaved. 


5.5.Z  Checkerboard  Problem 

This  problem  w<is  originally  suggested  by  [Boyan,  1993].  This  problem  requires  a  400  bit  solution  string, 
which  is  interpreted  as  a  20x20  grid.  The  objective  of  the  problem  is  to  create  a  checkerboard  pattern  of  O's 
and  I's  on  the  20x20  grid.  Each  location  with  a  value  of  1  should  be  sunoimded  in  all  four  directions  by  a 
value  of  0,  and  vice-versa.  Only  the  primary  foiu  direction  are  considered  in  the  evaluation.  The  evalua¬ 
tion  is  measured  by  coimting  the  number  of  correct  siurounding  bits  for  the  present  value  of  each  bit  posi¬ 
tion  for  a  18x18  grid,  centered  in  the  20x20  grid.  In  this  manner,  the  comers  are  not  coimted  in  the 
evaluation.  There  is  no  need  to  interpret  the  grid  as  a  toroid  as  only  an  18x18  grid  is  tised.  The  maximiim 
evaluation  is  12%  (18x18x4).  In  addition  to  the  methods  compared  in  the  previous  problems,  a  paredlel  GA 
with  10  populations  and  10  members  per  population  was  alk>  tested.  The  results  ^ue  shown  in  Figure  18. 
The  hill  climber's  average  eveduation  was  11^. 
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FigufclS:  llie  Checkerboard  Problem. 


5.6.  Sammary  of  Empirical  Results 

A  summary  of  die  results  in  diis  section  is  presented  in  Table  II;  diis  table  shows  die  problems  cm  which 
each  mediod  performed  the  best  Note  diat  all  of  the  PBIL  algorithms  which  used  a  non-zero  positive  and 
negative  learning  rate  are  grouped  together.  In  die  cases  in  which  all  algoriduns  achieved  nearly  identical 
performance  (De  Jong's  El  and  F3),  the  best  algotidun  was  chosen  to  be  the  one  which  found  the  solution 
in  the  fewest  function  evaluations. 


Table  II:  Summary  of  Results 


Algorithm 

Number 
of  “Wins” 

Problems  in  which  “Wins”  occurred. 

S-PBIL 

0 

N/A 

PBIL2 

8 

Jobshop  10x10,  Jobshop  20x5, 

(0.025, 

TSP30,TSP50, 

0.075, 

Bin  Pack  1,  Bin  Pack  3, 

0.1) 

De  Jong’s  F3, 

Checkerboard 

PBIL-N 

0 

N/A 

PBIL-LOW 

1 

De  Jong’s  F2 

Standard  Genetic 
Algorithm 

0 

N/A 

Multiple  Restart 

3 

Bin  Pack  2, 

Next  Step  Hill 

Griewangk, 

Gimber 

C)ider-4  Deceptive 

There  was  no  clear  best  widun  die  PBIL2  category.  However,  on  die  problems  on  which  PBIL2  performed 
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the  best,  a  setting  of  0.075  for  the  negative  learning  rate  parameter  was  abte  to  do  better  dun  PBIL,  PBIL- 
N,  PBIL-L,  SGA  aitd  the  Hill-Cliinbei.  Nonetheless,  the  (^>timal  setting  of  the  negative  learning  rate  is  very 
problem  dependent  For  example,  on  several  problems,  Jobshop  10x10,  Jobshop  20x5,  and  foe  Checker¬ 
board  problem,  foe  performaiwe  of  PBIL2  was  improved  with  dtemate  settings  of  foe  negative  learning 
rate. 

The  results  achieved  by  PBIL  are  more  accurate  and  are  attained  faster  than  a  ccmventicmal  genetic  algo- 
rifom.  For  example,  foe  average  final  result  obtained  by  foe  SGA  for  foe  20x5  jobshop  problem  was 
attained  in  generation  1702  (after  this  generation,  no  improvement  was  made),  PBIL2  (0.0^)  attained  the 
same  quality  solution  at  generaticxi  191.  This  type  of  ^eedup  is  found  in  many  of  foe  problems  attempted. 
The  PBIL-2  (0.075)  algorithm  was  able  to  find  equivalent  solutions  to  foe  SGA,  measvired  over  the  entire 
run,  in  fewer  generations.  Table  m  shows  in  which  generation  the  SGA  was  able  to  first  achieve  its  highest 
ev2duation,  and  in  which  generation  PBIL-2  (0.075)  was  first  able  to  stirpass  it  The  table  2dso  shows  in 
which  generaticm  foe  S-PBIL  algorithm  was  able  to  achieve  its  highest  evaluation,  and  in  which  generation 
foe  PBIL-2  (0.075)  algorithm  was  able  to  surpass  it  On  foe  De  Jong  functions  (F2  and  F3)  and  foe  order-4 
deceptive  problems,  all  algorithms  were  able  to  reach  approximately  the  s^une  solutions. 


Table  HI:  Relative  Performance  of  PBIL2  (0.075)  compared  with  SGA  and  S-PBIL 


PROBLEM 

Generation  in  whkh 
Hipest  SGA  Evahutkm 
was  Fust  Found. 
(Generation) 

Generation  in  which 
PBIL2  (0.075)  surpassed 
highest  SGA  Evaluatkm. 
(Generation) 

Generation  in  which 
Highest  S-PBIL 
Evaluation  was  First 
Found. 
(Generations) 

Geiteration  in  which 
PBIL2  (0.075)  surpassed 
highest  S-PBIL 
Evaluation. 
(Generations) 

Jobshop  10x10 

1894 

215 

1039 

353 

Jobshop  20x5 

1702 

191 

1271 

1015 

TSP30 

1928 

219 

1916 

1164 

TSP50 

1954 

148 

1%1 

444 

Bin  Pack  1 

1982 

1039 

1986 

784 

Bin  Pack  2 

1991 

1870 

1988 

1954 

Bin  Pack  3 

1949 

932 

1930 

968 

Griewangk 

1%2 

160 

1977 

200 

Checkerboard 

1985 

156 

1994 

1173 

6.  Conclusions 

The  simple  population-based  incremental  learner  performs  better  than  a  standard  genetic  algorithm  in  foe 
problems  empirically  compared  in  this  paper.  The  results  achieved  by  PBIL  are  more  accurate  and  are 
att2uned  faster  than  a  standard  genetic  algorithm,  both  in  terms  of  foe  number  of  evaluations  performed 
and  foe  clock  speed.  The  gains  in  clock  speed  were  achieved  because  of  foe  simplicity  of  foe  algorithm. 
The  algorithm  does  not  require  all  foe  meduiusms  of  a  GA;  rather  foe  few  steps  in  foe  dgorifom  are  small 
and  simple.  For  example,  foe  implementation  explored  for  this  study  -  with  all  of  foe  svirveyed  extensions 
-  totaled  less  than  1(X)  lines  of  C  code. 

One  explaiution  for  foe  success  of  foe  PBIL  algorithm  can  be  attributed  to  its  ability  to  focus  search  effort 
in  one  region  of  foe  space  very  quickly.  In  comparing  foe  ability  of  foe  GA  and  of  PBIL  to  focus  search,  foe 
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GA  was  able  to  maintain  diversity  longer  than  PBIL  (for  some  settings  of  the  learning  rate).  Ncmedwless, 
the  GAs  ability  to  postpone  c(Hnmitment  to  one  region  of  the  function  ^ace  does  not  allow  it  to  exten¬ 
sively  explore  any  region.  When  one  region  is  explored  extertsively,  the  GA's  populaticm  loses  diversity, 
and  thereby  also  loses  its  ability  to  explore  odier  regions.  The  done  to  make  the  commitntent  to  one  region 
is  longer  in  GAs  than  in  PBIL  However,  the  empirical  results  show  that  PBIL  does  enough  exploration 
before  the  commitment  is  made  to  outperform  a  crmventirmal  GA.  Therefore,  this  postp<Huement  of  ccnn- 
mitment  in  die  GA,  at  least  cm  the  prc4}lems  attempted  here,  is  not  necessruy.  This  line  of  explanation  is 
under  study. 

Another  benefit  over  traditiorud  genetic  algorithms  is  that  PBIL  is  easily  parallelizable.  Parallelization  of 
diis  algoridun  on  both  coarse  and  fine  grain  parallel  machines  is  simply  implemented.  Although  GAs  have 
also  been  parallelized  on  bod\  types  of  machines,  many  fine  grain  parallelization  methods  incur  a  large 
amount  of  inter-process  corrununication,  which  can  severely  affect  the  tirrre  performance  of  the  algorithm. 
For  example,  unless  the  GA  incorporates  a  subpopulation  structure,  a  single  synchroruzing  processor  must 
handle  the  pairing  of  die  solution  strings,  and  control  the  distribution  of  the  solution  strings  from  their 
origirud  processors  to  their  new  processors.  Although  a  subpopulation  model  will  ease  the  communication 
overhead,  both  algorithms,  PBIL  and  the  subpopulation  GA,  can  benefit  from  the  reduced  commurucation 
costs  in  this  modeL  In  the  parallelization  of  a  single  population  PBIL,  the  inter^processor  corrununication  is 
minimal,  as  ordy  the  highest  and  lowest  evaluaticm  vector  generated  are  returned  to  the  synchronizing 
processor.  The  updated  probability  vector  is  then  sent  to  all  of  the  processors.  All  of  the  surveyed  modifi¬ 
cations  to  the  origirud  PBIL  algorithm  also  incur  very  litde  overhead.  In  terms  of  d\e  number  of  fimction 
evaluations,  the  modifications  are  free.  Computationally,  the  additiorrs  do  not  add  noticeable  time  penal¬ 
ties. 

It  is  clear  from  the  results  that  using  negative  examples  helps  to  achieve  better  results.  However,  there  is 
no  clear  wirmer  regarding  the  setting  of  the  negative  learning  rate.  The  best  setting  of  the  negative  learning 
rate  varies  per  problem.  A  much  more  focused  study  should  be  done  on  each  of  the  parameters.  Partiadar 
attention  was  given  to  the  negative  learning  rate  in  this  paper  as  this  is  apparently  the  parameter  which 
has  the  least  amovmt  of  literature  available. 

The  empirical  results  presented  here  indicate  that  where  a  standard  genetic  algorithm  does  well,  this  algo¬ 
rithm  should  also  do  as  well  or  better.  However,  on  problems  on  which  genetic  algorithms  fad,  this  tech¬ 
nique  may  also  perform  badly.  Although  the  mechanisms  of  PBIL  are  different  than  those  usuaUy 
employed  in  GA  literature,  standard  PBIL  may  still  be  susceptible  to  some  of  the  suitability  issues  associ¬ 
ated  with  genetic  optimization.  For  example,  the  bin  packing  problems  have  good  approximation  algo¬ 
rithms  [Gaiey  and  Johnson,  1979],  and  in  fact  good  solutions  can  be  found  in  many  cases  with  1^- 
climbing  or  simple  depth-first  search  in  reasonable  amounts  of  time.  This  leads  to  questions  about  whether 
the  genetic  optirruzation  schemes  are  appropriate  for  this  class  of  problems.  This  paper  does  not  address 
this  issue,  as  it  has  been  a  topic  of  study  for  mcuiy  researchers  in  this  area.  The  scope  of  this  paper  is  limited 
to  improving  the  optimization  capabilities  of  GAs,  not  selecting  the  problems  to  which  GAs  should  be 
applied. 

Although  the  algorithm  presented  here  is  a  significant  departure  from  stand2ud  genetic  algorithms  and 
competitive  learning,  both  of  the  fields  have  contributed  enormously  to  the  final  algorithm.  The  competi¬ 
tive  learning  facet  of  the  algorithm  was  substantially  modified  from  the  majority  of  CL  algorithms  and 
applications  fovind  in  current  literature.  First,  the  aim  of  the  edgorithm  was  to  find  a  prototype  vector  for 
the  class  of  high  evaluation  vectors,  which  is  a  supervised  task.  Second,  the  training  of  the  CL-net  also 
directed  the  progression  of  which  points  would  be  seen  by  the  CL-  net;  usually  the  training  of  the  CL-net 
does  not  effect  the  points  which  are  to  be  used  in  the  training.  Firudly,  the  class  of  interest,  the  high  evalua¬ 
tion  vectors,  was  not  pre-defined.  Rather,  the  class  was  defined  relative  to  the  currently  generate  popula¬ 
tion  of  points.  The  highest  evaluation  vector  in  the  current  population  was  defined  to  be  in  the  class  of 
interest. 

Basic  genetic  algorithm  features  such  as  a  population  and  the  crossover  operator  were  implemented  in 
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quite  a  dilKerent  manner  than  the  simple  GA  enviskmed  by  Holland  and  De  Jong  [Holland,  1975]  [De  Jong, 
1975].  A  very  basic  intuition  of  the  progress  made  by  a  GA,  and  its  counterpart  in  PBIL,  was  briefly  given 
in  this  paper.  One  of  the  difficulties  in  standard  gerietic  algorithms  which  this  algorithm  elimiiuites  is  the 
issue  of  scaling.  In  the  beginning  of  this  paper;  it  was  mentiorwd  that  a  GA  needs  the  function  to  be  appro¬ 
priately  scaled  to  perform  accurate  optimiaatiort  This  algorithm  does  not  rely  on  any  spedflc  form  of  scal¬ 
ing,  it  rmly  requires  that  the  scaling  is  increasing  for  increasing  fitness.  The  scaling  is  inherently  relative, 
and  ordy  requires  the  best  and  worst  evaluatiorts  in  a  population.  Rardc-based  GAs  have  also  been 
described  in  literature;  this  approach  may  yield  improved  perfomumce  of  the  GA  on  the  test  problems. 

The  basic  algorithm,  as  it  is  presented  in  this  paper,  is  very  simple.  In  the  same  way  that  GAs  have  enjoyed 
a  great  deal  of  research  irKxeasing  their  effe^veness  for  a  variety  of  interest-spe^c  goals,  similar  addi- 
ti(His  can  be  made  for  PBIL.  In  this  paper,  c«dy  a  standard  GA  was  considered.  A  GA  with  diflerent  mech- 
ardsms,  such  as  non-stationary  mutation  rates,  local  optirxdzation  heuristics,  parallel  subpopulations, 
specialized  crossover,  or  larger  alphabets,  may  perform  better.  However,  all  of  ffiese  mechanisms,  with  the 
exception  of  specialized  crossover  operators,  can  be  used  with  PBIL  wiffi  no  modifications;  many  have 
been  implemented  with  very  prortdsing  results.  The  claim  is  not  that  the  basic  PBIL  will  be  able  to  outper¬ 
form  all  GAs;  more  work  needs  to  be  conducted  to  determine  where  each  method  will  prove  to  be  more 
benefidaL  The  hope  is  that  the  simple  framework  of  PBIL  will  be  used  as  an  underlying  model  from  which 
extensions  and  insights  can  be  formed. 


7.  Future  Directions 

7.1.  Non-Biruuy  Encodings 

Although  the  theoretical  studies  of  GAs  suggest  using  a  low  cardirudity  encoding  for  the  solution  strings, 
practice  has  often  revealed  good  soluticms  through  more  natural  encodings.  In  a  higher  cardinality  set,  the 
probability  of  each  position  can  be  represented  by  a  set  of  probabilities  for  generating  each  possible  mem¬ 
ber  at  each  location  in  the  solution  vector. 

Many  of  the  problems  attacked  in  genetic  algorithm  literature  are  continuous  valued  optimization  func¬ 
tions  which  have  been  discretized  to  an  arbitrary  value.  A  possible  method  of  generating  continuous  val¬ 
ues  is  to  specify  a  gaussian  curve  over  the  desired  range.  The  average  and  the  variance  for  each  variable 
can  be  "evolved"  during  the  search.  Methods  of  encoding  solutions  which  are  similar  to  this  are  often  used 
in  evolutionary  strategies  and  evolutionary  programming  [Baeck  &  Schwefel,  1993]. 


7.2.  Drawing  from  Genetic  Algorithm  Literature 


7.2.1.  Multiple  Popidations 

In  genetic  algorithms  research,  there  is  a  growing  trend  towards  using  parallel  genetic  algorithms.  As 
mentioned  before,  these  algorithms  explicitly  maintain  parallelism.  Similar  techniques  are  possible  in 
PBIL.  Multiple  prototype  vectors  can  be  formed  in  parallel,  each  generating  its  own  set  of  potential  solu¬ 
tions.  In  order  to  synffiesize  the  search  in  each  of  the  populations,  positions  in  the  probability  vectors  can 
be  swapped  [Juels,  1993][Gordon,  1993].  Prelimmary  experiments  have  been  tried,  and  have  revealed 
improved  results.  Although  this  superficially  appears  to  be  more  closely  related  to  the  LVQ  algorithm  as 
multiple  prototype  vectors  are  developed,  titis  is  not  the  case.  Each  prototype  vector  is  not  trying  to  distin¬ 
guish  unique  classes,  rather  all  of  the  vectors  are  used  to  define  only  the  single  class  of  interest. 
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7.22.  Hme  Varymg  Mutation  Rates 

One  of  the  current  areas  of  research  in  the  GA  coitununity  is  tfte  use  of  adaptive,  or  time-varying,  mutation 
rates.  These  provide  the  ability  to  control  die  amount  of  mutation  based  upon  the  conv«gence  of  the  pop¬ 
ulation.  The  need  for  mutation  is  largest  when  the  populatiot  has  converg^.  Convergence  in  the  PBIL  can 
be  measured  by  die  similarity  of  the  vecton  generate,  or  direedy  by  die  values  of  die  prototype  vector. 
Mutation  rates  can  be  inoeaMd  in  the  latter  portions  of  search  when  the  vectors  generated  may  be  too 
homogenous.  As  in  GAs,  this  may  aid  in  escaping  local  (^tima  by  maintaining  a  heterogenous  pc^ulatiexi. 

723.  Intelligent  Mutation 

Currendy,  mutation  is  implemented  by  selecting  a  positkm  to  mutate,  then  moving  the  value  in  the  proba¬ 
bility  vector  a  specified  (Stance  in  a  random  dire^on  (eidier  towards  0.0  or  1.0).  As  the  primary  goal  of 
mutation  is  to  preserve  diversity,  perhaps  a  m(»e  intelligent  method  of  using  mutation  is  to  move  the 
selected  position  towards  a  less  committ^  state:  03.  This  ensures  that  the  mutation  increases  the  amount 
of  diversity  in  the  next  population. 

72.4.  Elitist  Selecticm 

In  standard  genetic  algorithms,  there  is  no  guarantee  diat  once  a  good  solution  vector  is  found  that  it  will 
remain  in  the  population  of  the  subsequent  generation.  For  example,  it  may  not  be  selected  for  recombina¬ 
tion,  as  selecticm  is  probabilistic,  or  it  may  be  altered  by  mutation  and/or  crossover.  A  method  used  to 
resolve  tiiis  is  to  carry  the  best  solutiixi  vector  from  one  generatiixi  to  the  next  imaltered.  Similarly,  in  the 
PBIL  algorithm,  it  is  passible  that  aldiough  a  good  solution  vector  is  foimd  in  one  generation,  it  may  not  be 
found  again  in  the  next,  as  die  generatiem  of  solution  vectors  is  probabilistic.  In  a  manner  similar  to  elitist 
selection,  the  best  solution  vector  from  the  previous  generation  can  be  placed  into  the  current  generation. 
The  solution  vector  wUl  only  be  chosen  again  for  u^ating  the  probability  vector  if  no  better  solution  is 
found. 


73.  Drawing  from  Artificial  Neural  Networic  (ANN)  Literature 


7.3.1.  Competition  of  Probability  Vectors 

This  paper  has  presented  a  method  for  replacing  the  population  of  a  GA  by  a  single  probability  vector.  The 
update  rule  for  the  probability  vector  is  derived  from  competitive  learning.  However,  it  may  be  possible  to 
derive  more  than  the  update  rule  from  the  principles  of  competitive  learning.  Multiple  probability  vectors 
can  be  used  to  model  a  population.  The  next  generation  can  created  by  sampling  each  probability  vector 
for  a  subset  of  the  next  population.  Multiple  good  solution  vectors  can  be  us^  to  update  the  probability 
vectors.  Each  good  solution  vector  can  update  the  probability  vector  to  which  it  is  most  similar,  as  is  done 
in  standard  competitive  learning.  This  method  is  also  currently  under  study  by  [Baluja  &  Boyan,  1994]. 

7.3.2.  Adapting  the  Learning  Rate 

A  method  that  has  been  proposed  in  the  ANN  literature  to  ensure  that  the  prototype  vectors  in  a  competi¬ 
tive  learning  network  stabilize  is  to  lower  the  learning  rate  slowly  to  0.0.  However,  non-convergence  has 
not  been  a  problem  in  this  algorithm.  Rather,  a  large  problem  is  avoiding  premature  convergence.  As  the 
learning  rate  controls  the  amount  of  exploration  vs.  the  amount  of  exploitation  the  algorithm  is  perform¬ 
ing,  adjusting  the  learning  rate  during  the  course  of  a  training  session  may  provide  increased  ability  to 
explore  diverse  regions  of  the  hmetion  space. 
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7.33.  Explicitly  Avoiding  Premature  Convergence  of  Probabilities 

One  of  the  methods  suggested  by  Fahlman  to  aid  in  the  training  of  artificial  neural  networks  is  to  ensure 
that  the  outputs  of  the  neunms  in  an  ANN  are  not  pinned  to  extreme  values  [Fahlman,  1989].  This  is  not  an 
immediate  problem  in  PBIL,  as  the  update  rule  ensures  that  when  the  values  of  a  probability  begin  to  reach 
an  extreme,  a  move  away  from  die  extrone  is  larger  than  a  move  towards  die  extreme.  However,  only 
allowing  the  probabilities  to  come  within  a  pre-spedfied  distaiKe  of  dw  extremes,  as  Fahlman  [Fahlman, 
1989]  Im  done,  may  also  aid  in  avoiding  local  optima. 

7.3.4.  Using  Noise  to  Avoid  Premature  Convergence  of  Probabilities 

A  mediod  used  in  traditioiud  competitive  learning  to  ensure  that  all  output  units  are  eventually  activated 
is  to  add  noise  to  die  pattern  vectors  [Hertz,  Krogh  &Palmer,  1993].  In  PBIL,  once  the  highest  evaluation 
vector  is  determined  in  the  current  generation,  a  small  amount  of  noise  can  be  added  to  the  vector  before  it 
is  used  to  update  the  prototype  vector.  In  the  few  problems  tested  with  this  technique,  the  results  were 
improved  over  optimization  without  noise.  Nonetheless,  the  effectiveness  of  this  measure  must  be  more 
ceuefully  studied  before  firm  condusicms  can  be  made. 


7.4.  Choosing  the  Appropriate  Problems 

One  of  the  future  directions  which  should  immediately  be  considered  is  not  only  relevant  to  PBIL,  but  also 
to  the  field  of  genetic  algorithms.  The  problems  chosen  here  were  static  function  optimization  problems. 
They  were  chosen  because  either  the  exact  same  problems  or  problems  very  similar  to  these  appear  in  a 
large  amount  of  GA  research.  Although  much  of  the  GA  research  has  been  limited  to  the  type  of  static 
function  optimization  presented  here,  simpler  heuristics,  ranging  from  PBIL  to  restart-hill-  climbing  meth¬ 
ods,  have  been  shown  empirically  to  be  as  effective  as  genetic  algorithms.  For  example,  in  this  paper  alone, 
on  three  of  the  twelve  problems  attempted  (Bin  Packing  2,  Griewangk,  and  Order-4b)  the  hillcliiiibing  pro¬ 
cedure  performs  the  best  Further,  if  the  hilldimbing  algorithm  were  a  bit  more  sophisticated  and  included 
moves  to  regions  of  equal  evaluation  (this  would  also  necessitate  the  need  for  cyde  detection  for  restart) 
the  hill-climbing  performance  might  be  improved  on  some  of  the  problems  attempted  here.  This  reflects 
the  tremendous  need  to  more  carefully  study  and  define  the  domains  in  which  GAs,  and  evolutionary 
algorithms  in  general,  yield  benefit,  in  light  of  their  very  high  computational  burden. 
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10.  Appendix  A 

The  tjrpiczd  crossover  operators  which  are  used  in  s^dard  genetic  algorithms  are  described  here.  These 
operators  are  most  commonly  implemented  with  GAs  which  have  fixed  lengfii  potential  soluticm  repre- 
sentati<ms.  Other  crossover  operators  have  been  developed  for  problems  in  which  the  solution  representa¬ 
tions  may  not  be  fixed  in  length,  such  as  in  the  task  of  genetic  programming  [Koza,  1992].  Other  operators 
have  also  been  developed  to  take  advantage  of  specific  characteristics  of  the  problem  being  addres^.  The 
three  crossover  operators  described  here  are  general  purpose  recombination  operators  which  have  been 
used  and  studied  widely  in  genetic  algorithm  literature. 

One  Point  Crossoven 

Given  two  parent  chromosomes  A  &  B,  select  a  randomly  chosen  crossover  point,  swap  contents  of  the 
chromosomes  beyond  the  chosen  point: 

Parent  A  00000000000  00000 

Parent  B  11111111111  11111 

Child  A  00000000000  11111 

Child  B  11111111111  00000 

TWo  Point  Crossover 

Given  two  parent  chromosomes  A  &  B,  select  two  randomly  chosen  crossover  points,  swap  contents  of  the 
chromosomes  between  chosen  points: 

Parent  A  000  0000000000  000 

Parent  B  111  1111111111  111 

Child  A  000  1111111111  000 

Child  B  111  0000000000  111 

Uniform  Crossover 

Given  two  parent  chromosomes  A  &  B,  select  from  either  parent  randomly: 

Parent  A  0000000000000000 

Parent  B  1111111111111111 

Child  A  0101011010111000 

Child  B  1010100101000111 


11.  Appendix  B:  The  Hillclimbing  Algorithm 

In  order  to  judge  the  performance  of  the  GA  and  PBIL  algorithms,  the  problems  are  also  attempted  using 
hill-climbing  methods.  The  hill-climbing  algorithm  is  tested  on  the  same  problem  representations  as  were 
used  in  the  GA  and  PBIL  algorithms.  The  algorithm  is  shown  in  Figure  19. 
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L4— dMrltat 

V  4—  randomly  generate  solution  vector 
Best  4—  evaluate  (v) 

loop#  ITERATIONS 

pos  4—  random.integer  (1..LENGTH)  t  L 
Vpo,4-Rip_Blt(VpoJ 
V_oval  4—  evaluate  (V) 
if  (V_evaJ  >  Best) 

Beet  4—  V_eval 
L  4—  clear  list 
else 

Vpot<-F>iP-Bit(Vpo.) 

L  4—  add  pos  to  L 


USER  DEFINED  CONSTANTS: 

LENGTH:  the  iength  of  the  solution  encodng  (in  bits) 

ITERATIONS:  how  many  evaluations  are  to  take  place  throughout  the  mn  of  the  algorithm 

VARlABt.ES: 

V:  the  current  solution  vector 
V.eval:  the  current  evaluation. 

Best:  the  best  evaluation  ever  found. 

L  the  list  of  moves  attempted  which  resulted  in  evaluations  2  best. 


Figure  19:  The  hillclimbing  algorithm  used.  It  is  shown  for  solution  vectors  represented  in  binary. 
In  the  full  algorithm,  the  best  vector  along  with  its  evaluation  worild  be  saved. 


12.  Appendix  C:  A  Simple  Genetic  Algorithm 

This  appendix  gives  a  small  outline  of  a  basic  genetic  algorithm.  Of  course,  many  variations  are  possible, 
and  have  been  explored  in  literature.  The  GA  shown  in  Figure  20  is  generational,  and  every  solution  vector 
in  the  previous  generation  is  replaced  by  new  solution  vectors.  The  GA  also  uses  a  modest  form  of  elitist 
selection,  in  which  the  best  solution  vector  from  the  previous  generation  replaces  the  worst  solution  vector 
in  the  ament  generation. 
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i  <-  loop  #  P0PULAT10N_SIZE 

PopUatkmAi  landomiy  genenrta  a  solution  vector 
EvakiationAi  4—  evaluate  (PopulationAi) 
best  4—  find  highest  evaluation  in  EvaiuationA 

loop*  GENERATIONS 

loop  *  POPULATION_SIZE/2 

one  4—  select  vector  probabilistically,  based  upon  evaluation 
two  4—  select  vector  probabilistically,  based  upon  evaluation 
chlldl,  child2  4—  generate  two  children, 

based  on  crossover  of  (PopulationAo,^,  Population  At^) 
chlldl  4—  perform  mutation  (chlldl,  MUTATION_RATE) 
child2  4—  perform  mutation  {chiid2,  MUTATION_RATE) 

Populations  4—  add  chlldl  to  Populations 
Populations  4—  add  child2  to  Populations 

I  4-  loop  #  POPULATION_SIZE 

EvaluationSi  4—  evaluate  (PopulationSi) 

worst  4—  find  vector  corresponding  to  ttte  worst  evaluation  in  Population  S 

PopulationSyiK)r8t  ^  PopulationAtMst 
• 

PopulatlonA  4—  Populations 
EvaiuationA  4—  Evaluations 
best  4 —  find  highest  in  PopulatlonA 

USER  DEFINED  CONSTANTS 

GENERATIONS:  the  number  of  generations  the  algorithm  is  allowed  to  continue  (2000) 
POPULATION.SIZE:  the  number  of  samples  generated  per  generation  (100) 

MUTATION_RATE:  the  probability  of  flipping  each  bit  (0.001) 

IMPUCIT  DECISIONS  (CONSTANTS  NOT  SHOWN  HERE): 

CROSSOVER  RATE:  How  many  vectors  to  replace  in  subsequent  generations  (Every  solution  vector 
is  replaced) 

CROSSOVER  TYPE:  2  pt,  Ipt,  Unifonn,  Specialized,  etc.  (2  pt) 

ELITIST-SAVES:  how  many  solution  vectors  are  saved  from  the  previous  generation  (1) 

LENGTH:  the  length  of  the  solution  encoding  (Problem  specific) 

VARIABLES: 

PopulatlonA,  PopulationB:  arrays  of  solution  vectors.  Arrays  of  size  POPULATION_SIZE 
EvaiuationA,  Evaluations:  arrays  evaluations  of  the  solution  vectors  in  Population  A  &  B,  respectively 
chlldl ,  child2:  two  solution  vectors  produced  by  the  crossover  operations, 
best,  worst:  the  best  and  worst  in  a  population,  they  are  based  upon  the  vector's  evaluation. 

Figure  20:  The  basic  GA  algorithm  used  in  diis  study.  The  number  in  parentheses  are  the  settings 
used  for  the  empirical  studies  in  this  paper. 
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